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PREFACE 


This revision was made desirable for two reasons: the changing emphasis 
in the needs for statistical methods of different kinds and the rapid develop- 
ment of new, useful methods. 

The importance of statistical tests of hypotheses and of statistical infer- 
ences has continued to grow in research in the social sciences. At the same 
time, it is the author's belief that this does not necessarily diminish the impor- 
tance of descriptive statistics, which will always have its place. It is desir- 
able, then, to conserve what is useful of the latter, while expanding our atten- 
tion to the former. Within the limits of a single volume, the author has 
attempted to maintain an appropriate balance at a moderate level of statis- 
tical instruction that does not presuppose much in the way of a mathematical 
foundation. 

In attempting to maintain a balance, the author has retained the previous 
preponderance of attention to descriptive statistics. While the great impor- 
tance.of statistical significance cannot be denied, research in the social 
sciences is not confined to studies in which results are at the margin of signifi- 
cance. Also, the generation of scientific ideas, which after all is the most 
important requisite for scientific progress, does not depend particularly upon 
decisions concerning chance alternatives. The idea-generating step is rnuch 
more likely to depend upon awareness of statistical models provided by 
descriptive statistics than of those of sampling statistics. The value of the 
latter comes in at the end of an investigation. Tests of statistical signifi- 
cance serve an evaluative function rather than a creative one. 

The new material in this edition in the area of hypothesis testing and statis- 
tical inference includes several things. Among new applications of chi square 
are Bartlett's test of homogeneity of variance and combined tests of signifi- 
cance. Many of the new nonparametric, or distribution-free, tests of signifi- 
cance now available are included. Additional applications of analysis of 
variance are described, including the intraclass correlation. A more com- 
plete and coherent account of basic theory of hypothesis testing is presented 
at a simple level. New tables are provided to assist in connection with the 
added tests of significance, including exact probabilities in connection with 
chi square for very small samples. The discriminant function is introduced 
in connection with multiple-correlation methods. 
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The most thoroughly rewritten and reorganized part of the volume is in 
connection with the old Chaps. 9 to 11, which are now presented in four chap- 
ters, 9 to 12. Rearrangement of material in Chap. 9 puts first those things 
that are most likely to be included in an introductory course. Chapters 1 
through 8 and the first part of 9 thus probably serve better than before as 
material for a first course in statistics. Chapter 12, on test scales and norms 
in the old edition, is now the final chapter, thus improving the continuity in 
the central portion of the volume from which it was removed. 

In response to may requests, answers have been provided to all computa- 
tional problems in the exercises. Exercises have been revised in keeping 
with changes in the text. 

Eliminations have been made, in order to make room for the new material 
and in an effort to effect a net shortening of the volume. The final chapter 
of the second edition on scaling methods has been eliminated entirely. Some 
material in the chapters on reliability and validity of measurements has also 
been eliminated. In both instances this has been done in view of the fact 
that the same subject matter has been treated at much greater length in the 
second edition of the author's Psychometric Methods. Other statistics of less 
popular use have been omitted here and there, but some that might have been 
dropped have been retained because they appear nowhere else in texts that 
are in common use. 

Iam most indebted for suggestions to Dr. William W. Grings and Dr. Wil- 
liam B. Michael, who have taught with this text, and also to Dr. Harvey 
Dingman and Mr. James W. Frick, who assisted in preparation of material for 
the revision as well as making suggestions. I am indebted to a number of 
writers who have generously granted permission for the use of new material, as 
well as to those whose material has been carried over from previous editions. 


J. P. GUILFORD 
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CHAPTER 1 


INTRODUCTION FOR STUDENTS 


Why the Student Needs Statistics. Most seasoned workers in psychology 
or in education usually take the statistical methods for granted as an essential 
part of their routine, some more so and some less. The initiate may at first 
react to statistics as a frightful bogie whose mysteries loom forbiddingly 
before him, and he is likely to ask, '*What is the good of them, anyway?" 
This is particularly true of one who feels he has always had trouble with num- 
bers. Students who enter a first course in statistical methods in psychology 
or education, and probably in all related social sciences, range all the way 
from those who find mathematics in general easy and to their liking, to those 
at the other extreme who say they have difficulty in adding two and two. 
Somehow, all these must acquire what they can of a subject for which they 
are so unequally prepared. 

Probably no other subject demonstrates so clearly that there are several 
kinds of intelligence. No less a person intellectually than Charles Darwin 
had trouble with statistics, as he is said to have frankly admitted. His 
almost equally illustrious cousin, Sir Francis Galton, who is believed to have 
had an IQ of about 200, and who had so much to do with introducing statistics 
into psychology, had to turn some of his mathematical problems over to 
others for aid. 

There are different ways of understanding the same things. One student 
will grasp the new ideas offered by statistics in the way that a mathematician 
would understand them; another will appreciate the logical rules of thinking 
and the concepts provided as aids in thinking; still others will master rule-of- 
thumb operations and be able to carry through computations with a minimum 
grasp of what they are allabout. Learning without achieving insights and 
appreciations of the inner nature of things is learning without full motivation 
and enthusiasm and is not very satisfying. The average student will neces- 
sarily have to be content with levels of insight that fall short of those of the 
mathematician, remembering that even mathematicians have not by any 
means exhausted the meanings and ramifications of statistical ideas. On the 
other hand, each student should strive to inject as much meaning and signifi- 
cance, in his own way, as he can. The proper use and optimal use of statisti- 
cal methods and statistical thinking require certain minimal achievement of 
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understanding. Clerks can be taught to carry out many of the computational 
procedures; it is not the primary purpose of this book or of those who teach 
with it to develop computational clerks. The purpose is to develop those 
who could be supervisors of clerks. 

To be more specific, there are four simple, undeniable reasons why the 
student who takes a required course in statistics must develop some mastery 
of that subject. 

1. He must be able to read professional literature. There is no questioning 
the fact that learning in any field comes largely through reading. The stu- 
dent never finishes the extension of his skill in the art of reading, if he is a 
successful student. In any specialized field, reading is largely a matter of 
enlarging vocabulary. One cannot read much of the literature in any special- 
ized field in the social sciences, particularly psychology and education, with- 
out encountering statistical symbols, concepts, and ideas on every hand. One 
could do as the young child does when he tackles reading matter that is some- 
what beyond him, “skip over the hard places." But this is hardly excusable 
in the adult who is reading material that should not be beyond him and in 
which the “hard places” may, in fact, contain the crucial parts of the content. 
One who dodges such parts is likely to be dependent upon the conclusions of 
others for his own conclusions and opinions. This is hardly independent 
judgment or a symptom of mature scholarship. It is not necessary for every 
psychologist to be able to sail through the “heavier” mathematical contribu- 
tions of the specialist in statistics. It is severely limiting, however, for a 
person not to be able to read intelligently the average research paper in his 
field with some appreciation as to whether sound conclusions have been 
reached. The chances are that this appreciation will require familiarity 
with the basic statistical ideas. 

2. He must master techniques needed in advanced courses. Whether the 
advanced course is a laboratory course or a practicum, there are usually cer- 
tain incidental techniques that are commonly used in the operations involved. 
Tn the laboratory course, results cannot be treated or reports written without 
at least minimal statistical operations. A field survey or the checking of a 
report also involves inevitable statistical steps. 

3. Statistics is am essential part of professional training. The trained 
psychologist or educator likes to think of himself as a professional person. 
To some extent, statistical logic, statistica] thinking, and statistical operations 
are a necessary part of either profession. To the extent that he uses in his 
practice the common technical instruments, such as tests, the psychologist or 
educator will depend upon statistical background in their administration and 
in the interpretation of the results. Using tests without knowledge of the 
statistical reasoning upon which they depend is like the medical diagnosti- 
Cian's using clinical tests without a knowledge of physiology and pathology. 

4. Statistics are everywhere basic to research activities, To the extent that 
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either psychologist or educator intends to keep alive his research interests and 
research activities, he will necessarily lean upon his knowledge and skills in 
statistical methods. The relation of statistics to research will be elaborated 
upon in the next paragraphs. Here it is merely urged that in any professional 
fields where there are still so many unknowns as in psychology and education, 
the advancement of those professions and of the competence of their members 
depends to a high degree upon the continued research attitude and research 
efforts of those members. 

Why Statistics Are Important in Research. Briefly, the advantages of 
statistical thinking and operations in research are as follows: 

1. They permit the most exact kind of description. When all is said and done, 
the goal of science is description of phenomena, description so complete and 
so accurate that it is useful to anyone who can understand it when he reads 
the symbols in terms of which those phenomena are described. Mathematics 
and statistics are a part of our descriptive language, an outgrowth of our 
verbal symbols, peculiarly adapted to the efficient kind of description that the 
scientist demands. ` 

2. They force us to be definite and exact in our procedures and in our thinking. 
The writer once heard a prominent psychologist defend his rather vague 
conclusions by saying that he would rather be vague and right than to be 
definite and wrong. But the alternatives are not to be either “vague and 
right” or “definite and wrong.” One can also be definite and right, and it is 
the writer’s contention that the odds for being right are overwhelmingly on 
the “definite” side of the matter. уку, 

3. Slatistics enable us to summarize our results in meaningful and convenient 
form. Masses of observations taken by themselves are bewildering and 
almost meaningless. Before we can see the forest as well as the trees, order 
must be given to the data. Statistics provides an unrivaled device for bring- 
ing order out of chaos, of seeing the general picture in one's results. 

4. They enable us to draw general conclusions, and the process of extracting 
conclusions is carried out according to accepted rules. Furthermore, by 
means of statistical steps, we can say about how much faith should be placed 
in any conclusion and about how far we may extend our generalization. 

5. They enable us to make predictions of *how much" of a thing will happen 
under conditions we know and have measured. For example, we can predict 
the probable mark a freshman will earn in college algebra if we know his score 
in a general scholastic-ability test, his score in a special algebra-aptitude test, 
his average mark in high-school mathematics, and perhaps the number of 
hours per week that he devotes to studying algebra. Our prediction may be 
somewhat in error because of other factors that we have not accounted for, 
but our statistical methods will also tell us about how much margin of error 
to allow in our predictions. Thus not only can we make predictions but we 
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6. They enable us to analyze some of the causal factors out of complex and 
otherwise bewildering events. It is generally true in the social sciences, and in 
psychology and education in common with them, that any event or outcome 
isa resultant of numerous causal factors. The reasons why a man fails in his 
business or in his profession, for example, are varied and many. Causal fac- 
tors are usually best uncovered and proved by means of experimental method. 
If it could be shown that, all other factors being held constant, certain busi- 
nessmen fail to the extent that they possess some defect of personality X,“. 
then it is probable that X is a cause of failure in this type of business. Unfor- 
tunately for the social scientist, he cannot manage men and their affairs suffi- 
ciently to set up a good experiment of this type. The next best thing is to 
make a statistical study, taking businessmen as we find them, working under 
conditions as they normally do. The life-insurance expert does the same kind 
of thing when he follows the trail of all possible factors that influence the 
length of life and determines how important they are. On the basis of these 
statistical findings, he can predict about how long an individual of a certain 
type will probably live, and his insurance company can plan an insurance 
policy accordingly. Statistical methods are therefore often a necessary sub- 
stitute for experiments. Even where experiments are possible, the experi- 
mental data must ordinarily receive appropriate statistical treatment. 
Statistical methods are hence the constant companions of experiments. 

What This Volume's Treatment of Statistics Will Include. For the next 
few paragraphs we shall take a hasty ovérview of the things to come. The 
second chapter will give many more details of a general and preparatory 
nature. Here we shall try to look at the whole forest before we enter it. 

Descriptive and Sampling Statistics. It is common to make a broad dis- 
tinction between descriptive and sampling statistics. This distinction refers 
to two important uses of statistics. 

Tn the first place, statistics are used to describe situations. For example, 
averages tell us *how much" of certain quantities we have in a group of indi- 
viduals or in a group of observations. An average (for example, arithmetic 
mean, median, or mode) is a general-level concept. A single number tells how 
high one group, or sample, stands on a certain scale as compared with 
another. 

Other statistics tell us how much variability, or scatter, the individuals of a 
group show. A statistic known as the standard deviation has been the almost 
universal indicator of the amount of variability in a set of individuals or 
observations, though there are other indicators. 

A coefficient of correlation describes the closeness of relationship between 
twa sets of measures of the same group of individuals or observations. Most 
of science is concerned in finding out what things go with what, and what 
things are independent of what. Correlation methods, in the social sciences 
at least, are the most useful devices to answer these questions of interrelation- 
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ships. Averages, indices of dispersion and of correlation, are the basic and 
chief descriptive statistics. 

Sampling statistics tell us how well the statistics we obtain from measure- 
ments of single samples probably represent the larger populations from which 
the samples were drawn. Almost every statistic has a standard error. A 
standard error is an index number that leads us to conclusions concerning 
how far the statistic derived from the sample probably differs from the value 
we would obtain if we had measured an entire population. A population isa 
well-defined group of individuals or of observations. For example, it could 
be one composed of Wistar-Institute albino rats between the ages of 30 and 60 
days. Or it could be all possible reproductions a certain observer could make 
of a line 10 cm. long under the same conditions of rest, time of day, and 
method of reproduction, for example, by drawing a line with a pencil. A 
sample in either case would be a limited number of observations out of the 
entire population. Arriving at conclusions that can be generalized to all 
members of a population depends upon reducing discrepancies between 
population values and sample values to as small size as possible. This is 
probably best illustrated by the public-opinion polling, in which the margin of 
error of voting outcome can be expressed in terms of a percentage of error. 

In connection with sampling statistics, there is much in this volume on 
testing hypotheses. Scientific investigation proceeds from hypothesis to 
hypothesis. There are numerous hypotheses but relatively few established 
facts of a general nature. The sooner the research student realizes this point, 
the better for his clear thinking. There are some investigators, many of them 
well experienced, unfortunately, who do not make this distinction between a 
hypothesis and a fact; they mistake hypotheses for facts. For example, 
there is the hypothesis, stemming from Freudian psychology, that children 
suffering from asthma are of the “oral-dependent” type and that the breath- 
ing spasms are expressions of a cry for aid and for love. The plausibility of 
the idea, and its apparent consistency with other ideas, may be sufficient to 
lead many a clinical or psychiatric investigator to act as if the problem were 
solved, as if the idea were a fact. The properly skeptical investigator makes 
a study of a sample of asthmatic children and of their nonasthmatic siblings 
to see whether there is any greater incidence of dependency among the one 
group than among the other. Probably the most fruitful scientific investiga- 
tions, at least those that lead to dependable answers, or those that go beyond 
start by setting up a hypothesis, or several alternative 
hypotheses. Conditions are then arranged in such a way that if the results 
turn out one way, the hypothesis, or one of its alternatives, is supported and 
other hypotheses are rendered doubtful. The results must usually be cast in 
a statistical form which makes possible a decision between hypotheses. 

The simplest example of this is seen where we are studying the effects of 
one thing on another. Let us suppose that it is the effect of Benzedrine on 
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ability to reason. We restrict our problem to two alternative and mutually 
exclusive hypotheses: (1) that Benzedrine will affect thinking output or 
efficiency and (2) that it will not. The first hypothesis can be subdivided 
into two; that thinking will be facilitated and that thinking will be hindered, 
The typical experimental operations would be somewhat as follows, briefly 
described. We develop or adapt a test of reasoning power. We select two 
groups of individuals of comparable age, education, and /Q, both of the same 
sex. We determine that they are equal on a preliminary trial of the reasoning 
test. We administer the drug to one group and a control dose, or placebo, to 
the other. Neither group knows which has taken the drug. We administer 
another form of the reasoning test. We obtain two average scores, and there 
is some difference in a certain direction. The question is, does this obtained 
difference support hypothesis 1 or hypothesis 2? Could the difference have 
occurred by chance? If not, it must have been due to the drug, for so far 
as we know there is no other difference between the two groups that could 
account forit. It requires a test of the statistical significance of the difference 
to permit us to reject one hypothesis and accept the other. Having rejected 
the idea that the difference was due to chance, we may accept the idea that it 
was due to the drug. Without the statistical test we would be rather helpless 
in reaching a dependable answer. 

The Normal Distribution Curve. Every student is familiar with the normal 
distribution curve; it is ubiquitous in psychological and educational literature. 
‘There has been much use and abuse of it, and many erroneous things are said 
aboutit. The curve itself is a mathematical conception; it does not occur in 
nature; it is not a biological or a psychological curve. It is an ideal pattern 
which we can apply to useful purpose in many a situation. The distinction 
between statistics and applied statistics (like that between mathematics and 
applied mathematics) must be kept in mind. Many fruitful applications of 
the normal distribution curve in psychology and education will be described 
in later chapters. These applications are usually made without proof that 
human variations are normally distributed but with the assumption that they 
are normally distributed in order that we may benefit from the use of the 
mathematical properties of the normal curve. If there were knowledge about 
distributions of human qualities to the contrary, we would, of course, forgo 
these applications, Familiarity with the normal curve and its properties is 
therefore essential. 

Prediction and Statisties. Three chapters are organized under the heading 
of "prediction." Most textbooks of beginning psychology start out by say- 
ing that it is the purpose of psychology to predict and control human behavior. 
From that point on, not much more is said about prediction. Dealing with 
the very complex and intricate set of phenomena that behavior of living 
organisms presents, and realizing the limitations to accurate predictions, it is 
appropriate for us to be modest on the subject. We should not feel guilty, 
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however, about our failures to make predictions comparable with those in the 
physical sciences to the extent that we repress candid and realistic efforts to 
achieve the predictions that are possible, nor should we disparage our accom- 
plishments in that direction. 

The operation called prediction is actually made even when we do not 
realize it. The vocational counselor who tells a client that he should consider 
seriously vocations P, Q, and & and should shy away from vocations V, U, 
and W is tacitly predicting success in the one group and failure in the other. 
The clinician who diagnoses a person as having an anxiety neurosis is saying 
that he expects of this individual certain behavior. If he prescribes a certain 
program of therapy, he is predicting improvement under that treatment 
versus Jack of improvement if it is not applied. The promotion of a child to 
the next higher grade is a prediction that he will probably adjust better to 
that assignment than to reassignment to the same grade. Thus, almost all 
therapies and administrative decisions are, in effect, predictions, whether 
those who make those prescriptions would be willing to put themselves on 
record as making predictions or not. 

All predictions in psychology and education are what we often call acturial, 
That is, they are made on a statistical basis and with the knowledge that only 
“in the long run" will the practice that each prediction stands for be better 
than otherwise. Prediction of the single case is recognized as being involved 
with many chance elements. For the single case, the prediction is correct or 
it is incorrect, depending upon standards, In predicting in large numbers, 
there are certain probabilities of being right and being wrong. The degree of 
rightness or wrongness can then be determined. Statistical methods provide 
the basis for choosing what prediction to make and also a basis for knowing 
what the odds are for being right or wrong. The various ways of making 
predictions and the ways of determining their degree of accuracy will be 
treated at great length in Chaps, 14 to 16, 

Test Practice and Statistics, Because tests play such an important role in 
psychology and education, considerable attention has been given to them in 
this volume. Recent thinking by statistical psychologists and educators has 
changed drastically our former understanding of tests as instruments of 
measurement. Many of the findings have been reflected in the chapters 
treating tests, particularly Chaps. 17 and 18. Certain ideas of reliability and 
validity of tests had become rather securely entrenched in the thought and 
practice of test users. ‘These ideas are reexamined, and the newer experiences 
have been used to advantage in the applications of statistics to test practice, 

The Student's Aims in His Study of Statistics. With this overview of con- 
tent and with the preceding view of the needs and advantages of statistics, 
what should the student, particularly the beginner, aim to do about it? The 
beginner’s aims may be listed as follows, in order to make his task more 
specific, 
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1. To master the vocabulary of statistics. In order to read and understand a 
foreign language, there is always the necessity of building up an adequate 
vocabulary. To the beginner, statistics should be regarded as a foreign 
language, which, he should resolve, will not for long remain entirely foreign. 
The vocabulary consists of concepts that are symbolized by words and by 
letter symbols that are substituted for them. Along with mathematics in 
general, statistics shares the ordinary symbols for numerical operations. 
Thus, much of the vocabulary is already known to the student. As for the 
new concepts, their meanings will continue to grow the more the student uses 
them. 

2. To acquire, or to revive, and to extend skill in computation. Although it 
was stated earlier that it is not an important aim for the student to become a 
statistical clerk, computation is important. For many people, the under- 
standing of the concepts themselves comes largely through applying them in 
computing operations. The mere step-by-step activities with numbers, when 
certain goals are in mind, provide opportunities for new insights to occur. 
The average investigator is never free from a certain amount of computation 
work to be done. Computation skill, and this includes application of formu- 
las as well as planning efficient operations, like any skill, grows with practice. 
If there is discouragement at first, further attempts should correct that. 

3. To learn to interpret statistical results correctly. Statistical results can be 
useful only to the extent that they are correctly interpreted. With full and 
proper interpretations extracted from data, statistical results are a most 
powerful source of meaning and significance. Inadequately interpreted, they 
may represent wasted effort. Erroneously accounted for, they are worse 
than useless. It is the latter eventuality that leads to the common sour- 
grapish remark, “Anything can be proved by statistics.” In the hands of 
skilled operators, statistics make data “talk.” It is therefore very important 
that the implications of any statistical result be realized and that their proper 
meaning be made manifest. The average reader is less able to interpret the 
result than the investigator should be. Upon his shoulder rests the responsi- 
bility of telling the reader what the conclusions should be and to include, also, 
some indication of the limitations of those conclusions. 

4. To grasp the logic of statistics. Statistics provides a way of thinking as 
well as a vocabulary and a language. It is a logical system, like all mathe- 
matics, which is peculiarly adaptable to the handling of rational problems in 
science. This is hard to explain to the beginner. It is hoped that it may 
become more apparent as later chapters, particularly those dealing with 
sampling errors, hypotheses, predictions, and factor analysis, are encountered. 
The most efficient investigator is the one who masters the logical aspects of 
his research problem before he takes recourse to experiment or to field study. 
Proper formulation of a research problem is more than half the battle. Too 
many inexperienced investigators think of a question or a problem and rush 
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togather data before knowing whatit is they really-want to observe. Because 
it is realized that data of some kind must be collected, much time and effort 
are wasted in collecting the data, without thinking through the problem and 
coming to the proper decision as to just-what kind of data is needed. Or, data 
are collected in such a manner that no statistical operations now known are 
adequate to treat the data so as to extract an answer. Well-planned investiga- 
tions always include in their design clear considerations of the specific statistical 
operations to be employed. 

5. To learn where to apply statistics and where not to. While all statistical 
devices have their power to illuminate data, each has its limitations. It is in 
this respect that the average student will probably suffer most from lack of 
mathematical background, whether he realizes it or not. Every statistic is 
developed as a purely mathematical idea. As such, it rests upon certain 
assumptions. If those assumptions are true of the particular data with 
which we have to deal, the statistic may be appropriately applied. The 
student should note wherever a new statistic is introduced that there are 
likely to be mentioned certain assumptions or properties of the situation in 
which that statistic may be utilized. Unfortunately, one can encounter 
masses of numbers that look as if they are candidates for the use of a certain 
statistic, for example, a biserial coefficient of correlation (see Chap. 13), when 
actually to apply the statistic would be meaningless if not misleading. The 
student without mathematical background will have to learn these exceptions 
by rote or be satisfied with common-sense reasons. He probably would 
prefer to avoid making ridiculous applications, and when in doubt he should 
seek advice or refrain from the doubtful application. 

6. To understand the underlying mathematics of statistics. This objective 
will not apply to all students. But it should apply to more than those with 
unusual previous mathematical training. Many an intelligent student who 
has not been introduced to analytical geometry or calculus can nevertheless 
grasp many of the mathematical relationships underlying statistics. This 
will give him more than common-sense understandings of what goes on in the 
use of formulas. For the student with mathematical background and for all 
others who wish to know more about the underlying basis of statistics encoun- 
tered in the following chapters, the best single source is to be found in the book 
by Peters and Van Voorhis.! We cannot take space to duplicate such proofs 
in this volume. There are provided in the Appendix, however, a few mathe- 
matical derivations of formulas. The selection has been controlled by two 
considerations: (1) The only mathematics required to follow the proofs is that 
of ordinary algebra and basic calculus and (2) the proofs are not readily 
available elsewhere, either because they do not appear elsewhere or because 
the sources are scattered. 

1 Peters, C. C., and Van Voorhis, W. R. Statistical Procedures and Their Mathematical 
Bases. New York: McGraw-Hill, 1940. 
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Some Suggested Aids in Learning Statistics 


need of aids in reviewing those subjects, short of tbe employment of tutors. To such a 
student it is strongly recommended that he consult H. M. Walker's Mathematics Esseniial 
fer Statistica, New York: Holt, 1934. This little volume provides an excellent 
review, in the form of selected exercises, of the things that are most necded and in which 
many students show forgetting. The book is especially recommended to the student who 
has forgotten his high-school algebra. 

Statistical Workbooks. For the first and second semesters’ courses in which this text is 
sed, the student will find useful the two volumes by J. P. Guilford and W. B. Michael, 
Elementary Statistical Exercises, New York: McGraw-Hill, 1956, and Intermediate Statis- 
tical Ewercises, by the same publisher, ‘The first accompanies chaps. 2 through 8 and 
part of 9. The second covers much of the remaining material of this volume. 

Computational Aids. The wise student will make as much use as possible of all available 
mechanical aids in the form of calculating machines, tables, and the like. There are 
available that will serve when three-place accuracy is sufficient, 
of а large part of one's computations. Barlow's Tables, New York: 
Spon and Chamberlain, are admirable for supplying squares, square roots, and reciprocals 
for numbers from 1 to 12,500, J, W. Dunlap and А. К. Kurtz have provided many charts, 
tables, and formulas in their Handbook of Statistical Nomographs, Tables, and Formulas, 
Yonkers, N.Y.: World, 1932. Where great accuracy in numerical values based upon the 
normal curve is desired, the recommendation is the monograph by T. L. Kelley, The Kelley 
Statistical Tables, New York: Macmillan, 1938. 
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| CHAPTER 2 
COUNTING AND MEASURING 


Two Kinds of Numerical Data. Numerical data generally fall into two 
major kinds. Things are counted and this yields frequencies, or things are 
measured and this yields metric values, or scale values, Data of the first kind 
are often called ewameration data, and data of the second kind are called 
measurements, or metric dala. 

Statistical procedures deal with both kinds of data, which is the reason for 
this chapter, There are certain fundamental ideas about numbers and their 
use that it is well to have in mind before we go ahead. Perhaps it may seem 
strange to the reader, who has been counting and measuring as long as he can 
remember, that we should have to devote an entire chapter to these topics. 
The experts, who, we shall have to admit, have had a great deal more experi- 
ence with numbers and their use than most of us have had, never cease to 
report new ideas and insights as to the properties of the number system and 
as to its applications, It is well to keep in mind, incidentally, that there is a 
real difference between the number system, as such, and its application to 
counting and measuring. Much confused thinking has resulted from ignoring 
this fact, The world does not necessarily owe its existence to number and 
quantity. Numbers were invented by man as a symbolic system of internally 
consistent ideas which he can use effectively in describing the world as he 
knows it, thus gaining control over it. 

Data and Statistics. Before we go further, there are some frequently used 
terms that should be defined. These words are statistics and date, The 
word statistics itself has several meanings. On the one hand it stands for a 
branch of mathematics which specializes in enumeration data and their rela- 
tion to metric data, That is the meaning in the title of this book. 

Another meaning, popular but not used by technical people, is implied in 
the mother's statement when she says, “ Bobbie, stay out of the street, ot you 
will become a vital statistic.” Here the term in the singular refers to a fact 
of classification, which is a chief source of all statistics. What the mother 
meant is that Bobbie would change classification from the category "living " 
to the category dead. The keepers of vital statistics in the department of 
health and in other governmental agencies would have one less case among the 
living and one more among tbe not-living. This use of the term “statistics” 
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is more common among those agencies that keep the records. The numerical 
records are the statistics. While this use of the term is recognized by teachers 
and writers who specialize in statistics as a subject, their use of the term and 
the use of it in this book will usually mean something else. In the textbook 
and classroom situation, we are more inclined to use the word data in referring 
to details in the numerical records or reports. The fact that Bobbie is 
classified either among the living or the not-living is a datum. The word data 
always refers to more than one fact. 

Tn the textbook and classroom situation, too, the singular term statistic is 
most likely to mean a derived numerical value such as an average, a coefficient 
of correlation, or some other single descriptive concept. It may refer either 
to the idea of an average, a median, a standard deviation, etc., or toa particu- 
lar value computed from a set of data. The reader can usually tell from the 
context which usage of these terms is meant. 


Data IN CATEGORIES 


Probably most social data are in the form of categorical frequencies, the 
number of cases in defined classes or categories. The number of births, 
marriages, and deaths constitutes the bulk of the so-called vital statistics. 
The number of accidents, fatal or otherwise; the number of arrests for differ- 
ent reasons; and the number of new cases of poliomyelitis constitute other 
important information by which social agencies keep a finger on the pulse of 
human affairs. Political and economic interests also have their “barometers” 
for keeping informed of the trend of events, though some of these depend upon 
measurements of variables as well as upon counting cases. 

Classification. Before we count, in order to accumulate useful informa- 
tion, we must know what it is we count. We do not count indiscriminately, 
The frequency that we record refers to a particular class of objects, and this 
involves the process of classification. Classification of objects has been going 
on since Aristotle and even before Aristotle. It is a basic psychological 
process which can be seen in rudimentary form even in the simplest condi- 
tioned response. Wherever discriminations are made, along with generaliza- 
tions, classification of a sort occurs. Useful classifications for counting pur- 
poses, however, depend upon a high type of logical analysis. Much of 
science, following Aristotle, has been of the classificatory type. The classifi- 
cation of plant and animal life into species, genus, and order is the best exam- 
ple. Things thus become ordered and principles emerge. 

As science progresses, it is likely to abstract variables from its data, con- 
tinuous variations in single directions. This provides the way for more and 
more refined measurements. In spite of this general trend in a science, how- 
ever, the classification of phenomena will probably never cease to be useful. 
Besides, there are some absolute categories that seem not reducible to con- 
tinuous variables—life and death; married and unmarried; male and female; 
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and voter and nonvoter. Such discrete classes must be recognized and are 
usefully dealt with in research as well as in public affairs. Classification, 
then, is a very useful and necessary process in science as well as in practical 
life. It is the procedure by which objects become categorized for counting. 

Some Psychological Categories. Before specifying the way in which cate- 
gories should be set up and utilized, it may be well to have in mind some 
examples of the more common kinds from the field of psychology. In experi- 
mental psychology, particularly in psychophysical studies, we have categories 
of judgment, The second of a pair of stimuli is judged as "greater than,” 
“equal to,” or “less than” the first. In public-opinion polling, responses are 
obtained in a small number of categories that are intended to be meaningful 
for interpretation purposes. In answer to the question, “Are you in favor of 
the Marshall Plan?” the response might be “ Yes,” “No,” “I do not know 
what the Marshall Plan is,” or “I know what the plan is but I am undecided.” 
In taking a vocational-interest test the examinee may be required to respond 
in one of three categories, “L” (for like), “I” (for indifferent), or “D” (for 
dislike), concerning the thing proposed. In a problem-solving experiment 
with rats, after some preliminary observations, solutions might be categorized 
as falling into one of four types. Clinical types in psychopathology are 
categories mostly of long-standing recognition. And so one could continue. 
Many categories used in research are not static; they change as new light 
is thrown on the field of study. Some categories are invented for tempo- 
rary duty as provisional scaffolding upon which to arrange data for better 
inspection. 

There is not space here to give detailed instructions on how to choose or to 
construct useful categories.! It may suffice to say, and it may seem trite to 
do so, that categories should be well defined, mutually exclusive (if possible), 
univocal, and exhaustive. The importance of good definitions cannot be over- 
estimated. Making proper assignment of cases to classes depends upon it. 
Being understood by one’s colleagues also depends upon it. A prime require- 
ment of scientific findings is that they shall be communicable to others. 
Other investigators should be able, if they so desire, to repeat our operations 
to test our results. The requirement of mutual exclusiveness is perhaps the 
most difficult to achieve. Lack of it probably means something is missing in 
defining the basis of classification. Lack of it means some overlapping, 
interdependence, and loss of power to draw clear-cut conclusions. A set of 
unique categories means that there is one and only one basis of classification. 
To group school children into three classes, boys, girls, and Mexicans, is to 
inject two principles or bases: sex difference and race difference. Perhaps 
anything as grossly absurd is easily avoided; it is the more subtle confusion of 
variables that causes trouble. By being exhaustive, a set of categories pro- 

1 For further details on this subject, see Peatman, J. G. Descriptive and Sampling Statis- 
tics. New York: Harper, 1947. Chap. 2. D 


14 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION len. 2 


vides a place for all cases. If there are only two classes, such as delinquents 
and nondelinquents, and if they are well differentiated by objective criteria, 
even two categories can be exhaustive. In many a system, particularly when 
more than two classes are needed, there is often a necessity for one miscel- 
laneous group. This group is distinguished merely on the basis of failure to 
place its members anywhere else. These cases are often ignored, but if they 
are numerous it probably means biased sampling in other categories. It also 
probably means lack of adequacy for the classificatory system as a whole. 

Qualitative and Quantitative Categories. Most of the examples of categories 
given thus far have been what we call qualitative, The classes of objects are 
different in kind. There is no reason for saying that one is greater or less, 
higher or lower, better or worse than another. ‘The basis is some qualitative 
attribute. There may be some intrinsic or some external basis for thinking 
of the classes as being ordered on a scale of more or less, but, if so, we are 
unaware of it. There are, however, many classifications in which the groups 
can be ordered according to quantity or amount. It may be that the cases 
vary continuously along a continuum that we recognize but on which we can- 
not yet make measurements for lack of an instrument; we can only group ina 
gross manner, Ratings on a scale of five points (and even more) may well be 
regarded as such a categorizing. In such situations, the categories cannot be 
defined, perhaps, in any independent terms. Each one may be distinguish- 
able merely by the fact that similar groups of cases are in it and these differ 
notably from members of other classes. 

Another instance is where the experimental controls are in graded steps. 
Five groups of subjects receive different amounts of instruction of a certain 
kind. In selection by means of tests, examinees are categorized into the 
accepted and the rejected groups. Later, after training or service on the job, 
there is a further classification between those who are satisfactory and those 
who are not. Experimental and technological practice is full of such exam- 
ples. Later chapters will explain methods for dealing with them. The next 
chapter will show how metric data are most conveniently handled by some- 
what arbitrary groupings in successive categories. 

Frequencies, Percentages, Proportions, Ratios. A frequency has already 
been defined as the number of objects in a category. There are some other 
related concepts that, though common in advanced arithmetic, most students 
do not appreciate fully. They play an important role throughout this vol- 
ume. We cannot review all the arithmetical features of these concepts here, 
but there are certain new uses of them that should be stressed and certain 
pitfalls to be pointed out. 

Let us consider an example to illustrate the use of percentages. In Table 
2.1 are given some original data in the form of frequencies in 12 categories. 
The categories are in a two-way classification, one qualitative and the other 

quantitative. The data pertain to the number of students in training and 
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Taste 2.1. ELIMINATION Rates FOR BOMBARDIER STUDENTS оғ THREE LEVELS ОР 
APTITUDE IN Four ARMY Am FORCE TRAINING SCHOOLS* 
r 


Aptitude level 
School 
Num- | Num- 
berin| ber cent 
train- | elimi- elimi- 
ing |nated nated | nated 
A 62 26 160 | 28.4 
B 69 23 84 | 17.9 
c 69 20 78 | 13,7 
D 139 21 49| 8.7 
Allschools,.| 339 | 90 371 | 17.2 


* Aptitude was measured in terms of a composite score on psychological tests. The data were 
selected from results during the early months of World War II. (Adapted from unpublished data of 
the AAF Training Command. This will be true of other ААР data used in this volume unless otherwise 
specified.) 


the number of these eliminated in each of four bombardier schools in the 
Army Air Force during the early part of World War II. In each school the 
students had been categorized in three levels as to aptitude, The categoriza- 
tion by schools is qualitative and that by aptitude is quantitative. Such a 
table would probably be set up to study the relation of elimination rate to 
aptitude and also to differences between schools. We can make comparisons 
both ways. There will be some comments, a little later, on how to prepare a 
good table. Here we are interested in another point: the use of percentages. 

Percentage as a Rate Index. If we wanted to compare schools as to elimi- 
nations, the number eliminated in each school would be a poor index, particu- 
larly when our comparison is made at somewhat constant levels of aptitude. 
For example, at the low level of aptitude, the numbers of eliminations were 
not very different: 26, 23, 20, and 21. If we gave credence to such small 
differences, we should place the schools in the rank order A, В, D, and C, 
from most to least eliminations. Schools A, B, and C had comparable num- 
bers in training, but school D had about twice as many. This makes us 
suspicious of the use of mere numbers eliminated as the way to compare 
schools. To put the schools on a fair basis we need to find an index of elimi- 
nation rate. We should ask what the elimination “scores” would have been 
if all schools had had equal numbers in training. If we assume that common 
number in training to be 100, the number eliminated per hundred is a familiar 
percentage. The percentages of eliminations for students of low aptitude are 
41.9, 33.3, 29.0, and 15.1. Twenty-six is 41.9 per cent of 62; 23 is 33.3 per 
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cent of 69; and soon. Now we see that there are larger differences (this is 
partly because three of the denominators, 62, 69, and 69, are less than 100) 
between schools, and the rank order is now A, B, C, and D. The, inversion of 
the order of C and D is decisive; at least D’s position below C now seems 
decisive. The point of this illustration is that percentages are used to com- 
pare groups of objects on an equitable basis. Frequencies alone will not do 
when such comparisons are to be made. 

Some Limitations to the Use of Percentages. Some precautions should be 
pointed out concerning the use of percentages. Ideally, a percentage of any 
number less than about 100 should be computed with hesitation. If the 
number is less than 100, a change, by chance, of only one case added to or 
removed from a category would mean a change of more than 1 percent. If 
we ask what per cent 15 is of 25, the answer is 60. But if the frequency were 
to gain one, the percentage would be 64. If a lower limit must be mentioned 
as a total below which computation of percentages is unwise, it might be 
placed at 20. At this number, a change of one case would mean a correspond- 
ing change of 5 per cent. This is being quite liberal for the sake of applying a 
very useful index, 

In line with the discussion above, it would seem to be not very meaningful 
to report percentages to any decimal places unless the total number of cases 
exceeds 100. When we want a percentage for use in further computations, 
however, it would be wise to retain at least one decimal place. Frequencies 
are “exact” numbers (see p. 29), and percentages based upon them are 
accurate to as many decimal places as we wish to use. They thus describe 
the sample in terms of per hundred. It is when we become interested in 
letting an obtained percentage stand for a population value (see Chap. 9) 
that we must become conservative about reporting it. In Table 2.1 all 
Percentages were reported to one decimal place because most of them were 
based upon totals greater than 100 and all were made consistent. Con- 
sistency of this sort carries some weight but should not be pushed too far. 

When a percentage turns out to be less than 1.0 (for example, .2 per cent), 
it is not so meaningful as larger ones and, what is worse, it may be mistaken 
for a proportion (all proportions are less than, if not equal to, 1.0). In some 
Social statistics a series of percentages may be this small. In this case it is 
common practice to change the base from 100 to 1,000 or even more, for 
example, to report 15 deaths per 100,000, 5 cases in 1,000, and the like. As 
percentages these would read .0015 and. S, respectively. To avoid confusion 
with proportions, these should be written as 0.0015 per cent and 0.5 per cent. 

Proportions. Whereas with percentages the common base is 100, with pro- 
portions the base, or total, is 1.0. A proportion is a part, or fraction, of 1.0, 
A proportion is 1/100 of a Percentage, and a percentage is 100 times a propor- 
tion. Careless individuals often call a percentage a proportion and vice 
versa. By definition, and in all strictness, they are different concepts. The 
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symbol used for percentage is capital P; for proportion the symbol is a lower- 
case p. This should help to fix the idea of the relative sizes of the two. The 
proportion of eliminees among low-aptitude students at school A was .419 
(see Table 2.1); for high-aptitude students at school B the proportion of 
eliminees was .080. 

As compared with percentages, proportions have some advantages as well 
as disadvantages. They are less familiar to nonmathematical individuals 
than are percentages. Whenever results are reported to the general reader, 
then, percentages are almost always to be preferred. Percentages have 
another advantage in that we can speak of percentage of gain or of loss. Pro- 
portions are always parts of something and can never exceed the total, which 
is 1.0. They have no place in expressing gain or loss, though presumably 
losses could be expressed in terms of proportions if we chose, for losses cannot 
exceed the total; but we never use a proportion for this purpose. 

The advantages of proportions are best seen in later chapters. They are 
used more than percentages, in connection with the normal distribution curve, 
in connection with item analysis of tests, with certain correlation methods, 
and so on. It has already been said that percentages may be mistaken for 
proportions when they are less than 1.0. Since proportions can never be 
greater than 1.0, they are much less likely to be mistaken for percentages. 

Probabilities. Another advantage of proportions is their relation to proba- 
bilities. Every probability can be expressed in the form of a proportion. 
We say that the probability of getting a head in tossing a coin is 1/2 or 1 
chance in 2. This is a more manageable figure if expressed as a probability 
of.5. We say that in throwing a die the probability of getting a six spot is 
1in6. Expressed as a proportion this is 167. In general, for computation 
purposes, decimal fractions are much preferred to common fractions; they 
are much more easily manipulated in addition and subtraction and in finding 
squares and square roots. The interchangeability of proportions and proba- 
bilities will be found to be a very common occurrence in the later chapters. 

Ratios. A ratio is a fraction. The ratio of a to b is the fraction a/b. A 
proportion is a special ratio, the ratio of a part toa total. We may also have 
ratios of one part to another. For example, there were 69 low-aptitude 
students in training school B (Table 2.1), of whom 23 were eliminated and 46 
were graduated. The ratio of graduates to eliminees was 46/23, or 2 to 1. 
This ratio can also be expressed as 2.0. The ratio of eliminees to graduates 
was 23/46, or .5. This could also be expressed as .5 to 1 but ordinarily is 
not. At any rate, in a ratio the base is 1.0, as it is in a proportion. The 
chief difference is that a proportion is restricted to the ratio of part to total, 
whereas ratios are not. 

Ratios are useful as index numbers. They describe rates and relationships. 
The 70 is an index number of rate of general mental growth—the ratio of 
mental age to chronological age (multiplied by 100). Comparisons of 
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incomes of regions are made in terms of per capita—the ratio of total income 
to population. Costs of education are more meaningful if stated in terms of 
dollars per pupil per day attended rather than in terms of total sums of 
expenditures. In dealing with index numbers one should keep in mind the 
operations by which they were derived. It sometimes makes a difference 
when they are used in computation, as in averaging them or in correlation 
problems (see p. 71 and Chap. 13). 

Tabulation of Data. Every student who writes a report based upon data 
is faced with the problem of how best to organize them in tables. Tables 
serve several purposes. There are tables that list the raw, or original, data. 
Lists of scores in several tests earned by different individuals provide an 
example. Although these may be very long in some reports, many readers 
like to see them presented in full so that they may apply checks or perform 
other operations than the investigator used. One common way to present 
these tables is in an appendix to the report. 

A second type of table is a summarizing device. It is used to present an 
organized and curtailed picture of what is in the original data. It includes 
such descriptive statistics as means, standard deviations, and the like, with 
the data grouped in one or more meaningful ways. Table 2.1 is an example 
of this type. All the essential information is there. Such a table should tell 
a complete story of its kind. It should be given a title that tells clearly what 
the table is about. Tf the title becomes too long it is better to relegate to a 
footnote some of the secondary information. Headings of columns and rows 
should be descriptive, and their spacing and the lining should show clearly 
to what columns or rows they belong. A table should be so labeled that the 
reader need not turn to the text material in order to know what is there. 

How to Prepare Tables. The organization of such a table, in columns and 
rows, should take into consideration, first, what are the main points that 
should be brought out. In Table 2.1 probably the more important compari- 
son to be made is that of the different schools. A person concerned with the 
administrative aspects of bombardier training certainly would think so. 
One who is concerned with the development of aptitude tests would, of course, 
be interested in the other problem, the relation of elimination rate to aptitude 
level. In the latter case, a distinction between schools would be of little 
importance. Having decided which relationship is of most interest, the data 
should be arranged so that the comparisons one wants to make most are 
easiest to observe. Here let us say we want to compare schools, The best 
basis of comparison is in terms of elimination rates. The four elimination 
rates are in an uninterrupted column. Comparison of elimination rates for 
different aptitude levels is more difficult because other numbers intervene. 

A second consideration, and it is of less importance, is the practical one of 
keeping the dimensions of the table consistent with the dimensions of a page. 
Columns can be longer than rows; consequently, considering the space avail- 
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able for headings and the widths of numbers, we can fit the data in the avail- 
able space. With small tables this is no problem. Ordinarily, long lists go 
better in columns and short lists in rows. Another consideration is the 
psychological fact that horizontal eye movements are easier and more natural 
for a reader than are vertical movements. All these considerations must be 
weighed and balanced against one another. 

A third type of table is a final, summarizing one. This brings together the 
salient findings from several tables. The second type may, of course, serve 
the same function; it all depends upon the scope and nature of the study. If 
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* Numbers like this represent totals in training in various groups. 
Fic. 2.1, Percentage of bombardier students eliminated from training in four different 
Army Air Force schools during the early part of World War II. Comparisons are made at 
three different aptitude levels. 


there is a final-type table, however, it serves as a basis for major conclusions 
of the study. 

Graphic Representation of Data. The graphic representation of data has 
become such an extensive art that it is possible to provide only an introduc- 
tion to the subject here. A few fundamental principles will be mentioned 
and illustrated. A “picture may be worth 10,000 words” but only if it is 
properly done. The first requirement is that it tell a complete story for what 
it is intended to convey. 

Bar Diagrams. Probably the most common type of figure for displaying 
frequencies or percentages for categories is the bar diagram. It is very 


adaptable to many purposes and arrangements. 
Figures 2.1 and 2.2 are designed to represent the data of Table 2.1. In 
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these examples, the bars are in the vertical position, but bars can also be 
placed in the horizontal position (Figs. 2.3 and 2.4). In Fig. 2.1 the data are 
grouped so as to show best a comparison of the different schools. There are 
three groups of bars, one for each level of aptitude of students, and within 
each group every school is represented. In each case, the same kind of 
shading is used for the same school. The schools were arranged, in general, 
in their order of elimination rate. They should be in the same order in the 
three groups. This facilitates cross comparisons between aptitude levels and 
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Fro. 2.2. Percentage of bombardier students eliminated from training at three different 
levels of aptitude. Comparisons are made in four different bombardier schools of the 
Army Air Force during the early part of World War II. 


gives an idea of trend within each group. Figure 2.2 was designed to empha- 
size comparison of elimination rates as dependent upon aptitude level. There 
are four groups, one for each school, with three bars in each group. Here 
the quantitative nature of the aptitude variable determines the order of the 
three bars in each group. 

In both diagrams, note that the numbers of students in training are given 
at the tops of the bars. The statistically minded reader will want to know 
these values as a basis for judging about how reliable each percentage is and 
whether differences he sees in the bars are probably genuine or are perhaps 
due to chance. He cannot be sure about these questions unless he applies 
some procedures described in Chap. 9, but he can get a rough idea just by 
knowing the total numbers and by general past experience. 

Figures 2.3 and 2.4 show some data in response to the question “How many 
times did you feel afraid while flying on a combat mission?" applied to air- 
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crew personnel just returning from combat to redistribution stations in the 
United States.! The categories of responses were “Every time, or almost 
every time," “About one-quarter to three-quarters of the times," “One to 
How many times did you 
Reel afraid while fing - 
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Fic. 2,3. Percentages of officer versus enlisted personnel in samples of Army Air Force 
combat returnees who responded in specified ways to a question concerning fear in combat. 
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Fro. 2.4. Percentages of responses of each type given to the question, “How many times 
did you feel afraid while flying on a combat mission?” by samples of officer and enlisted 
personnel who had returned from tours of combat duty in the Army Air Force. 
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cal device. In Fig. 2.3 the bars are designed to compare officer with enlisted 
aircrew personnel. For each category of response the bars for these two 
kinds of personnel are shown juxtaposed. The numerical percentage values 
are also written in so that the reader will have the more accurate information 
that numbers provide if he wants it. The sizes of samples are given below 
the diagram so that the reader may have some basis for degree of confidence 
in the differences represented. 

Figure 2.4 shows another arrangement of the same data. In this diagram 
we obtain a better conception of the proportions of reactions in each category 
for officers as a group and for enlisted men as a group, as well as some possi- 
bility of comparing the two in each category because the two bars are pre- 
sented parallel and the category percentages in the same rank order. 
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Fic. 2.5. Descriptions of the status of new recruits to flying training in the Army Air Force 
during the early part of World War II with respect to previous flying experience, marital 
condition, and type of training preferred. 


Pie Diagrams. Another kind of picture that is sometimes used to show 
proportions of a total is the pie diagram. The 360 degrees of a circle are sub- 
divided in proportion to the number or percentage in each category. Figure 
2.5 is an illustration. It shows the situation with regard to aviation cadets 
in the AAF with respect to three principles of classification: previous flying 
experience, marital status, and training preference. The number in the 
total sample is given below each diagram. The numerical percentage is 
written in each segment which is shaded differently from others in the same 
"pie" The category name is also written in a segment if there is room; if 
not, it is written just outside. 

The pie diagram is restricted to this kind of display, the proportions of a 
total. It is inferior to the bar diagram, such as that in Fig. 2.4 (which also 
demonstrates proportions of wholes), when we want to compare the same 
categories in two samples. 

Trend Charts. When showing changes in frequencies, percentages, or pro- 

. portions over a period of time, a frend chart or belt graph is desirable. One 
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could show a bar for each sample and place the bars in time order, but this 
would not picture changing conditions nearly as well as something continuous. 
Figure 2.6 is drawn to represent such changing conditions or trends in a cer- 
tain situation. The data are in terms of percentages of aviation students 
interviewed, who were subsequently recommended to different types of 
assignment. The data arose from the psychological unit at one classification 
center during World War II and cover a period of 15 months during the last 
part of 1942 and the first part of 1943. Observations were grouped by quar- 
ters, or three-month periods. The students interviewed were those whose 
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Fic. 2.6. Trend in the percentages of interviewed aviation students in the Army Air Force 
who were recommended for various assignments during a 15-month period of World War II. 
[Adapted from data in the AAF Aviation Psychology Research Program, Report No. 2, The 
Classification Program, P. H. DuBois, (ed.), p. 346.] 


classification on the basis of aptitude scores and expressed preferences for 
different types of training was not obvious under the prevailing regulations 
at the time. 

In some trend charts the frequencies are plotted—for example, those repre- 
senting population growths or those representing changes in income. In con- 
nection with the data of Fig. 2.6, we are not interested in numbers but, for 
administrative reasons, in proportions of students disposed of in each of four 
ways, for assignment to one of three types of training or to ground duty. 
The reasons for any trends are, of course, not obvious from the picture itself 
but, knowing the picture, a study of the situation would probably yield an 
explanation of the causes and suggest, if necessary, corrective measures. 
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There are other trend charts of various kinds. Ina broad sense, all curves 
of learning and retention would be included. Their nature is so well known 
that they need not be described here. 

Pictographs. The layman, who is probably not interested in statistics or 
numbers, can be induced to read reports and to gain impressions the writer 
wishes to make, if the picture is dressed up in terms of concrete objects. 
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Fre. 2.7. Percentage of pilot students at each aptitude level who graduated from primary 
training in one sample of Army Air Force trainees during World War II. (From Aircrew 
Selection and Training, a publication of the AAF Training Command Heanquarters, 1944.) 
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Figure 2.7 is one example that was used to display to the average reader the 
relationship that existed at the time between graduation rate and aptitude of 
pilot students in the AAF. It requires a minimum of statistical sophistica- 
tion to interpret such a picture, and the cartoonish quality of the drawings 
attracts attention and interest. Very effective reporting of statistical results 
to the general public is done in this manner, The number and variety of 
ways in which this can be applied are limited only by the ingenuity of the 
reporter. 
MEASUREMENTS 


Some Examples of Psychological Measurements. In order to make our 
discussion concrete and specific, let us consider some typical examples of 
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measurements commonly made by psychologists. Perhaps the first examples 
that come to mind are scores on tests of mental ability. These are usually in 
terms of the number of correct responses to test items. A similar kind of 
measurement is seen in scores on a personality questionnaire or a vocational- 
interest inventory. In these cases it is not the number of “correct” responses 
but the number of responses indicating the same interest or trait, often 
weighted in proportion to their supposed diagnostic value. Also in the area 
of mental tests we find the frequent reference to “chronological age,” “‘men- 
tal age,” and that ratio between the two, the “intelligence quotient.” 

In the experimental laboratory as well as in the clinic, we frequently meas- 
ure in terms of the time required to complete a specified test or task. In 
memory experiments, we measure learning efficiency in terms of the number 
of trials to attain a certain standard of performance or in terms of the “ good- 
ness” of performance at the end of a certain trial or time. We measure 
efficiency of retention in terms of the time required for relearning (overcoming 
the forgetting that has taken place) and the efficiency of recall in terms of 
association time or in terms of the number of items correctly recited. 

In the sphere of motivation, we gauge the strength of drive in terms of the 
amount of punishment (electric shock) an organism (for example, a rat) will 
endure in order to reach his immediate goal or in terms of the number of 
times he will take a constant punishment in order to attain the same result. 
The difficulty of a task or test item can now be specified in quantitative terms, 
as can the affective value (degree of liking or disliking) for a color, a sound, or 
a pictorial design. In studies of sensory and perceptual powers, the threshold 
stimulus and the differential limen are given in terms of stimulus magnitudes. 
The span of perception or of apprehension is given in terms of the average 
number of items that the observer can report correctly after momentary 
exposures. The galvanic skin response, the pupillary response, and the 
amount of salivation also serve as quantitative indicators of amounts of 
psychological happenings. 

Some Examples of Educational Measurement. Many an educational 
problem is also a psychological problem, and its mode of measurement has 
been indicated in the preceding paragraphs. Achievement in any area of 
learning, like any mental ability, is measurable in terms of test scores. 
Marks, however obtained, have been the traditional mode of evaluating 
students in specific units of formal education. Attendance records, data on 
size of classes, on budgets, on supplies, and on other material aspects of the 
well-regulated school system compose another list of measurements in educa- 
tion. Outcomes of educational effort are often expressed quantitatively in 
terms of promotion statistics, achievement ratios, and estimates of teaching 
success. Whether for purposes of research in education or for systematic 
and meaningful record keeping, statistical methods become indispensable 
tools. 
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Some Different Kinds of Measurement. In a superficial way, it is easy to 
see, as one glances over the list of psychological and educational measure- 
ments just mentioned, that there are different kinds of measurement involved. 
Among the psychologist's measurements, some are in terms of the stimulus— 
for example, the threshold stimulus or stimulus difference, the number of 
syllables or items, the amount of electric shock, etc. Others are in terms of 
the amount of response—for example, time of the response, number of 
responses or of correct responses, degree of the response, etc. Some measure- 
ments are more direct, such as reaction time, and others more indirect, such 
as affective value and difficulty. Some measurements are in terms of discrete 
units—number of individuals, syllables, words, items, crossings—and others 
are in terms of continuous scales—age, time of response, amount of punish- 
ment, and degree of effort. In the discrete type of measurement, things can 
increase or decrease only by changing one whole unit at a time, whereas in the 
continuous type the increase or decrease can be by as small a fraction of a unit 
as one pleases and can distinguish. Although this difference has a logical 
significance, in statistical practice, actually, we generally treat discrete and 
continuous measurements in the same manner. 

Rank Orders and Other Measurements. In a most general sense, we 
make a measurement whenever we assign numbers to things in such a way 
that those things are placed in order. Suppose that we place three boys, 
Charles, Bob, and David, in rank order for height, Charles being rank 3 
(tallest) and David, rank 1 (shortest). The numbers 3, 2, and 1, attached 
to Charles, Bob, and David, give us some useful information, such as the 
inference that Charles is taller than Bob and that Bob is taller than David. 
"These numbers do not tell us much more. Since they are merely ranks, we 
cannot say that Charles is as much taller than Bob as Bob is taller than 
David. We cannot say that Bob is two times as tall as David or that Charles 
is three times as tall as David. Measurements in terms of rank order simply 
give us the serial arrangement of things. 

As we saw from the example just given, we are not at liberty to add and 
subtract or to multiply and divide such numbers. Had we actually applied 
a meter stick to these three boys and found that their heights were: Charles, 
195 cm., Bob, 180 cm., and David, 150 cm., matters would be different. Now 
we can make some further deductions about the heights of these boys. We 
can Say that the difference between Bob and David is two times that between 
Bob and Charles, Knowing that Charles is 15 cm. taller than Bob and that 
Bob is 30 cm. taller than David, we can infer that Charles is 45 cm. taller than 
David. We can say that Bob is 20 per cent taller than David and that 
Charles is 30 per cent taller than David. It is apparent that we can 
now perform all the arithmetical operations of addition, subtraction, 
multiplication, and division with the three numbers assigned to the three 
boys. 
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Best Measurements Require an Equal Unit and an Absolute Zero. Some 
measurements obtained in psychology and education are comparable with the 
measurements of height (linear distance) just mentioned, but most are not. 
Many measurements should be regarded as merely placing things in rank 
order until it is demonstrated that they give us more accurate information 
than that. We have something considerably better than rank order when 
our measuring scale possesses equal units. When this is true, a gain of a 
unit in one part of the scale is equal to a gain of a unit in any other part of the 
scale. We can then perform a number of different operations with numbers 
assigned to objects on such a scale that would otherwise be precluded. 

A measuring scale is not complete, however, unless it also has an absolute 
zero point. An example of a scale that has equal units but not an absolute 
zero point is the centigrade thermometer. The zero point is arbitrarily 
placed at the freezing point of water. With this instrument, we can say that 
the temperature of the weather changes as much when it rises from 0 to 25 
as it does when it rises from 25 to 50. But we cannot say that 50? is twice as 
warm as 25? or that 100° is twice as hot as 50°. We can find differences 
between numbers on this scale and get sensible answers, but we cannot 
multiply and divide. If we translate our zero mark to the absolute zero 
point (zero heat), which in terms of the common thermometer is —273°, then 
we can perform these operations. On the absolute scale, our 25° becomes 
298°, and our 50° becomes 323°, Now it is obvious that the higher of the 
two (323) is not two times the lower (298). But if our absolute centigrade 
scale is correct, with regard to equality of units, we may well say that a 
temperature twice as hot physically as 298° is a temperature of 596° (also on 
the absolute scale). 

Mental-test Scales as Metric Devices. What shall we say of a measuring 
scale of the type most frequently used in psychology and education—mental- 
test scores in terms of number of items correct? Have we here a scale with 
absolute zero and equal units? Strictly speaking, usually not, A score of 
zero, no items correctly answered, does not mean zero ability. For had we 
included some easier items, even the lowest individual in the test could 
probably have made a score numerically greater than zero. Thus we are 
unable to say that a score of 50 points means twice the ability represented by 
a score of 25 or half the ability represented by a score of 100 points. For if 
our real zero-ability score should have been some 25 points below our arbi- 
trary one, these three scores would then become 50, 75, and 125. 

Now the second is nof twice the first or half the third. Nor can we be sure 
that out units are equal within the range of scores obtained. Unless the 
units were equal, we should not be able to say that a score of 100 is as far 
above one of 75 as the latter-is above a score of 50. As a matter of long 
experience, however, we find that test scores generally behave as if units 
were equal, as if one item correct adds an amount to the measurement of 
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ability equal to that added by any other item correct. There are various 
indications that tell the experienced worker in statistics when his measure- 
ments probably possess equal units and when they do not. And when they 
do, we can proceed to apply most of the ordinary statistical procedures. 
When we strongly suspect that they do not, we can make adjustments or 
substitute other statistical methods that do apply. The beginner in statisti- 
cal work need not be too much concerned about trying to decide the matter, 
but he should be aware that there. are natural limitations to what one may do 
in the way of statistics and that most of our ordinary conclusions are sound 
only in so far as equal units (and much less often an absolute zero point) 
prevail in the measuring scale. 

How Numbers Should Be Regarded in Measurement. Most measure- 
ments are taken to the nearest unit—nearest foot, inch, centimeter, or milli- 
meter, depending upon the fineness of the measuring instrument and the 
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Fis. 2.8. An illustration of two metric scales, showing selected units and their limits. 


accuracy we demand for the purposes at hand. In giving the height of a 
tree, measurement to the nearest foot—for example, 107 ft.—would be ade- 
quate. In giving the height of a girl, we should resort to inches or perhaps 
centimeters as our practical unit. In giving the length of a needle, we should 
probably report in terms of millimeters, and in giving its diameter as seen 
under a micrometer, we should resort to some smaller unit. In any case, we 
may notice that our object does not contain an exact number of our chosen 
units. Our tree is more than 107 ft. but is closer to 107 than it is to 108; our 
girl is not exactly 156 cm. but is closer to 156 than to 155; etc. The result is 
that our report of 107 for the tree means anything between 106.5 and 107.5 ft., 
and our report for the girl means anything between 155.5 and 156.5 cm. 
Figure 2.8 shows a graphic illustration of units and their limits. 

And so it is with most psychological and educational measurements. A 
test score of 48 is taken to mean from 47.5 to 48.5, and an obtained score of 
70 means from 69.5 to 70.5. We assume that a score is never a point on the 
scale but occupies an interval from a half unit below to a half unit above the 
given number. We can make this seem more reasonable by arguing that the 
person making a score of 48 actually might be just a fraction of a unit better 
than 47.5 at the moment, and being better than 47.5 is sufficient to give him a 
whole score of 48. Or our individual might just fail to be as good as 48.5 on 
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the same test, but, not being quite good enough to achieve 49 items, he falls 
back to 48. Although our tests are probably never so refined as to cause an 
individual to waver between fractions of a point (the margin of error is 
usually more than a whole point), this kind of argument rationalizes our 
procedure from one standpoint. 

A more important practical consideration dictates the taking of a score as 
occupying a whole interval on the scale, as the student will appreciate later. If 
we did not do this, an average computed from a set of ungrouped measure- 
ments would not be consistent with one computed when the same measure- 
ments are grouped. Even in dealing with discrete measurements, as, for 
example, the number of children in a family, we customarily proceed as if 8 
children meant anywhere from 7.5 to 8.5. The only notable exception to 
this general rule is in dealing with chronological age as given to the last birth- 
day and the like. Then a twelve-year-old child is anywhere from 12.0 to 13.0. 
If ages are given fo the nearest birthday, however, our rule again applies, and a 
twelve-year-old falls in the interval 11.5 to 12.5. 


SowE RULES REGARDING NUMBERS 


Approximate and Exact Numbers. Measurements, when taken to the 
nearest whole unit, are known as approximate numbers. They are always 
“fuzzy” and are of uncertain value within the unit where they fall. When 
we find a number by enumeration of discrete objects, we have an exact num- 
ber, for example, 15 men, 42 letters, or 50 pencils. The distinction between 
exact and approximate numbers we shall find important when they are used 
in calculations. Some rules about calculations are presented next. They 
would be unnecessary if all numbers in statistics were exact. 

How to Round Numbers. The beginner in statistical computation 
invariably asks, How many decimal places shall I save?” In just this form, 
the question cannot be answered. The question should read instead, How 
much accuracy have I in the answer?" A number may have been rounded, 
dropping all digits to the right of the decimal point, yet not all the remaining 
figures may be accurate. Another number may have four places remaining 
to the right of the decimal point, yet all of them may be accurate. Some 
students may, if they lack good rules, drop too many figures, thus losing 
much of the accuracy that they really have; others may save a string of 
figures beyond the limit of accuracy, giving the appearance of great exactness 
that is really fictitious. 

First let us be clear as to the proper way to round a number. There is no 
particular difficulty in rounding to the nearest whole number; 15.7 becomes 
16, and 27.4 becomes 27; 9.6 becomes 10, and 0.96 becomes 1. In rounding 
to two decimal places, the same principles apply; 2.1827 becomes 2.18, and 
90.2179 becomes 91.22. It is when the first digit to be dropped is 5 that 
difficulties arise. In rounding to two decimal places, again, the number 
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7.1654 becomes 7.17, and even 7.16502 becomes 7.17 rather than 7.16, for the 
reason that the decimal fraction beyond the 6 is greater than just .00500. 
Had the number been 7.16499, we should have rounded to 7.16, because it is a 
shade closer to 7.16 than to 7.17. 

When the number is 7.16500 (equidistant between 7.16 and 7.17) we follow 
an arbitrary rule that when the digit preceding the 5 is an even number we 
leave it as it is, but when this number is odd we raise it to the next digit. 
Thus 7.16500 would be rounded to 7.16, but 7.17500 is rounded to 7.18. The 
main reason for this is that when such numbers are summed, in a long series, 
we should have had by chance as many that were raised a half point as were 
lowered the same amount, and the changes will tend to compensate for one 
another. 

A word should be added about leaving a rounded number ending in the 
digit 5. For example, the number 6.21499 rounded to three decimal places 
becomes 6.215. Were we to round this further, following our rule, we should 
have 6.22. In view of the original number, this would be incorrect. It 
would have been well to indicate when the number 6.215 was given that the 5 
came by rounding upward or that the original number was less than 5 in the 
third decimal place. We can do this by writing it as 6.215— to show 
this fact. The number 42.5-- has been rounded from something greater 
than 42.50. Further rounding to a whole number gives 43, in spite of the 
odd-even rule offered above. 

How Many Significant Figures in a Number? When a measurement is 
given as 107 ft., the number is not only accurate to the nearest unit but is also 
said to be accurate to three significant figures. In spite of the fact that this 
measurement was taken only to the nearest foot, the 7 fixes the value between 
106.5 and 107.5, which makes the 7 significant. If we had, instead, a meas- 
urement of 107.3 ft., there would be accuracy to the nearest tenth of a foot 
and four significant figures. The .3 added to the number now fixes the meas- 
urement between 107.25 and 107.35 ft., tying the last place to the .3 ft. 

The number .00156 has just three significant figures or digits. They are 
the only ones that tell us about the numerical value, the two zeros being 
required merely to locate the position of the decimal point. The number 
15600, likewise, has only three significant digits, again the two zeros merely 
being used as “fillers” to locate the decimal point. If this were given as the 
approximate cost of a certain boat in dollars, we should conclude that the 
cost was anywhere from 15550 to 15650 dollars. But if it had been written as 
15600., with a decimal point after the last zero, this would indicate that meas- 
urement was to the nearest unit, or within the limits of 15599.5 to 15600.5 
dollars. 

When zeros come between other digits, they count as significant figures. 
Thus 1002.1 has five significant figures, and .071021 also has five. Any other 
zero not used to fix the decimal point is also usually significant, as in .420, 
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which has three significant digits, since the last digit fixes the number 
between .4195 and .4205. A lone zero before the decimal point, however, as 
0.41, is not significant, since it adds nothing to our information concerning 
numerical value. 

Rules Governing Significant Figures in Computation. The following rules 
will determine how many significant figures there are in a number found by 
computation. 

1. In Sums of Numbers. Case I. When all the numbers added are 
regarded as accurate to the nearest unit, the sum is regarded as accurate to 
the nearest unit. 

Example: 47 + 161 + 5,171 = 5,379, a sum that is accurate to the nearest unit and that 
has four significant figures. 

A similar case occurs when all the numbers added have the same number 
of decimal places. 

Example: 2.91 + 40.22 + 0.07 = 43.20, where the answer is accurate to the second 
decimal place because all the numbers were accurate to that place. 

Case II. When numbers that are not accurate to the same number of 
places at the right of the decimal point are added, the sum is accurate only as 
far as the number having the smallest number of decimal places. 


Example: 17.257 + 142.1 + 75.47 = 234.8, which is rounded from 234.827. Note 
that the rounding was done after summing and not before. 


A similar rule is true when numbers rounded to the /ef of the decimal point 
are summed. 

Example: 75,000 + 3,845 = 79,000, which is rounded from 78.845 because in the first 
number there are only two significant digits to the left of the hundreds place. 

2. In Differences. Case I. If the two numbers are accurate to the same 
digit at the right, the difference is also accurate that far to the right. 

Example: 173.24 — 98.84 = 78.40, the zero being significant. 

Frequently a difference is drastically reduced in the number of significant 


figures, so much so that further computations with this difference are some- 
times lacking in desired accuracy. This situation is to be avoided when 


possible. 
Example: 4.692 — 4.685 = 0.007. 
Case II. As with addition, the answer is accurate no further to the right 


than is the number whose accuracy extends less far to the right. In the fol- 
lowing examples, the answers are rounded to as many significant figures as 


are accurate. 


Example: 115.1 — 82.715 = 92.4 (not 92.385). 
Example: 5,200 — 829 = 4,400 (not 4.371). 
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In both these cases, contrary to the practice in summing numbers, the round- 
ing can just as well be done before subtracting, for the result will be the same 
either way. 


3. In Products of Numbers. Case I. The product of two approximate 
numbers has no more accurate significant digits than has the number with 
the smaller number of significant digits. 


Example: 41.57 X 1.3 = 54 (not 54.014), 


Case II. The product of an exact number times an approximate number 
has no more accurate significant figures than has the approximate number. 


Example; 24.091 X 22 = 530.00 (where 22 is an exact number). 
Example: 24.09 X 72 = 1,734 (where 72 is an exact number), 


Case III. The product of two exact numbers is accurate to all obtained 
digits. 
Example: 175 X 42 = 7,350 (which may be written as 7,350.). 


4. In OQuolienis. Case I. The quotient of two approximate numbers has 
no more accurate significant digits than the one having the smaller number of 
significant digits. 


Example: 7.182 + 2.3 = 3.1 (not 3.12261). 
Example: 4.07 + 0.2815 = 14.5 (not 14.458). 


Case II. The quotient from an exact and an approximate number con- 
tains no more accurate significant numbers than the approximate number. 


Exam ple: 7.1025 + 22 = 0.32284 (where 22 is an exact number), 


Cask ПІ. The quotient of two exact numbers may be written to as many 
significant figures as one wishes, 


5. In Squaring a Number. Since this is a matter of multiplying a number 
by itself, the same rules as those governing products will apply. In general, 
the square of an approximate number contains no more accurate significant 
figures than the number itself. 

6. In Square Roots of Numbers. Case I. The square root of an approxi- 
mate number contains roughly the same number of significant figures as the 
number itself. The square root of 85.7, for example, may be taken appropri- 
ately to be 9.26, to three significant figures. 

Case II. The square root of an exact number may be given to as many 
places as one wishes, 

Example: 1/5 = 2.2361. This could be carried further, or we could round it to 2.236 
or to 2.24, depending upon our purposes. 

In many Statistical problems which the student will encounter, the square 
root of a number of persons or observations will be utilized (see Chap. 9 
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particularly). The number of discrete objects is an exact number; thus the 
square root can be carried as far as one wishes. A good practice to follow is 
to think how many significant digits are needed for further computation. As 
a general suggestion, one might use not less than three significant digits in 
such a square root. 

Application of the Rules. Although the rules as just given are acceptable 
and sound, one should use them as guides and not follow them slavishly. One 
frequently has to use his best judgment and do the most reasonable thing. 
To follow the rules rigidly at every step of the way would sometimes introduce 
inaccuracies or else cause one to lose information that he really has and needs. 
One good general principle to follow is to carry along more significant figures 
through the successive steps of calculation than would be required for strict 
accuracy under the rules and withhold the rounding of numbers until the final 
answer is obtained, such as an arithmetic mean, a standard deviation, or a 
correlation coefficient. At the end of a solution, one may decide upon the 
extent of accuracy in the answer by applying the rules to every step in the 
series of numerical operations. This is difficult in some problems because of 
the many steps. There are also other things to be considered in particular 
situations, such as the standard error (see Chap. 9) of the statistic computed. 
For these reasons further suggestions will be offered more appropriately later 
when we are dealing with specific cases. 

The student will now see the reason for the earlier statement (p. 29) to the 
effect that the question How many decimal places shall I save?” cannot be 
answered very simply. The most important things to carry away from the 
discussion above are a better appreciation of the problems of accuracy 
and, roughly, some of the limitations to accuracy of figures derived from 
measurements. 


Exercises 


1. In a certain school in a southwestern city, the fifth grade had 80 pupils, of whom 
32 were of white, American-born stock, 20 were of Mexican, 12 of Japanese, and 16 of 
American-Indian stock. Complete the following table: 


Stock Frequency | Percentage | Proportion 


American White.. 32 
Mexican 
Japanese. 
American - Indian 


25.0 


AS 


2. In the preceding data, what was the ratio of Mexicans to Indians? Of American 
white to Japanese? Of Indian to American white? 

3. In selecting a child at random from the fifth-grade group, what is the probability 
of getting a Mexican? Of getting a Japanese? An Indian? Either a Mexican or an 
Indian? 
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4. In the fourth grade of the same school, the following numbers of children appeared: 
American white, 47; Mexican, 27; Japanese, 11; and Indian, 15. In the third grade the 
numbers were: 66, 30, 6, and 18, respectively. Prepare a tabulation of the data in the 
three grades. Draw conclusions from the table. 

5. Draw bar diagrams representing the racial datà given above. 

6. Draw a trend chart representing the same data. 


7. State the exact limits to the following scores or measurements: 57 sec. 150 kg. 
65 score points 0 score points 14.5 cm. 125 sec. 15 years (to the last 
birthday) 


8. Round the following numbers to one decimal place: 26.418 4.072 4.98 
9.092 120.052 0.3500 44.7508 291.6500 8.8502 31.15 — 48.25. 
9. How many significant figures in each of the following numbers: 1,942 20,007 
170.9 0.31 28,000 21,500 0.3400 0.0017. 
10. Write the answers to the following problems to as many significant figures as the 
rules concerning accuracy allow: 
a. 2.14 in. times 15 (where 15 is an exact number). 
b. 5.2 + 17.2509 + 918.04. 
c. 242.8 X 0.075. 
d, 4.27505 divided by 25 (where 25 is an exact number). 
e. 17.98 divided by 2.1. 
J. 38.6 squared. 
g. У/50 (where 50 is an exact number, but be reasonable). 
h. 25,32. 


Answers 

1. Frequencies: 32; 20; 12; 16. 
Percentages: 40.0; 25.0; 15.0; 20.0, 
Proportions: .40; .25; .15; .20. 

2. 5/4; 8/3; 1/2. 

3. 1/4; 3/20; 1/5; 9/20. d 

7, 56.5 to 57.5; 149.5 to 150.5; 64.5 to 65.5; —0.5 to +0.5; 14.45 to 14.55; .1245 to .1255; 
15.0 to 15.9. 

8. 26.4; 4.1; 5.0; 9.1; 120.1; 0.4; 44.8; 291.6; 8.9; 31.1; 48.3. 

9. 4; S; 43 23 23 3; 4; 2. 

10. (a) 32.1; (b) 940.5; (c) 18; (d) 0.171002; (e) 8.6; (f) 1,490; (g) 7.071; (№) 5.032. 


CHAPTER 3 


FREQUENCY DISTRIBUTIONS 


After we obtain a set of measurements, the next customary step is to put 
them in systematic order by grouping them in classes. A set of individual 
measurements, taken as they come, as in the list in Table 3.1, does not convey 
much useful information to us. We have merely a vague, general conception 
of about how large they run numerically, but that is about all. The data in 
Table 3.1 are scores made by 50 students in an ink-blot test. Each score is 


Taste 3.1. SCORES IN AN INK-BLOT TEST 


the number of objects the student reported in observing 10 ink blots during a 
period of 10 min. Concerning such a set of data we usually want to know 
several things. One is what kind of score the average or typical student 
makes; another concerns the amount of variability there is in the group or 
how large the individual differences are; and a third is something about the 
shape of the distribution of scores, i.e., whether the students tend to bunch up 
at either end of the range or at the middle or whether they are about equally 
scattered over the entire range. The first steps in the direction of answering 
these questions require the setting up of a frequency distribution. 


Tur CLASS INrERVAL—lTS LIMITS AND FREQUENCIES 


The Size of Class Interval. We could begin by asking how many scores of 
25 there are, of 26, 27, etc., but this would not give us an adequate picture, 
because in a group of only 50 individuals whose scores range from 10 to 55, 
many scores do not occur at all and others occur only once. We therefore 
combine the scores into a relatively small number of class intervals, each class 
interval covering the same range of score units on the scale of measurement. 

The first thing to be decided is the size of the class interval. How many 
units shall it contain? This choice is dictated by two general customs to 
which experience has led us to agree. One is the rule that we should have not 

35 


36 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION  [cu. 3 


less than 10 nor more than 20 class intervals. Though in rare instances we find 
workers going outside those limits, the general tendency is for them to keep 
within the boundaries of 10 to 15. The small number of groups is favored 
by the fact that we often deal with small numbers of individuals in our meas- 
ured sample and by the urge for convenience. The larger number is favored 
by the desire for accuracy of computation, because the process of grouping 
willintroduce minor errors into the calculations, and the coarser the grouping, 
i.e., tffe smaller the number of classes, the greater is this tendency. 

Some Sizes Preferred. The second rule determining the choice of class interval 
is that certain ranges of units (scores) are preferred. They are 1, 2, 3, 5, 10, and 
20. These six intervals will be found to take care of almost all sets of data. 
To apply these rules to our data in Table 3.1, we need first to know the total 
range of scores from highest to lowest. The highest score is 55, and the lowest 
is 10, which gives us a total range of 46 points (one more than the highest 
minus the lowest). An interval of 3 pcints is the one that would give us the 
best number of classes that our first rule requires. It will be found that the 
range divided by the number of units in the class interval (in this case 46 
divided by 3) ordinarily gives the total number of class intervals needed to 
cover the range. In this instance, we should therefore have 16 groups. If 
we chose 5 units as our class interval, we should have 496, which is 10 groups. 
In view of the relatively small number of cases, and because an interval of 5 
will give us the minimum of 10 groups, we choose 5 as our class interval. 

Where to Start the Class Intervals. It would bea quite natural tendency 
to start the intervals with their lowest scores at multiples of the size of the 
interval; when the interval is 3, to start them with 9, 12, 15, 18, etc.; when the 
interval is 5, to start with 10, 15, 20, 25, 30, etc. This is by far the most 
common practice, though it is admittedly arbitrary. When the size of the 
interval is 3 or 5, there are arguments for starting intervals in such a way that 
the multiple of the size of interval is in exactly the middle of the group. Thus 
the grouping by three’s would give groups like 8, 9, 10, and 14, 15, 16, etc.; 
by five’s, it would be 8, 9, 10, 11, 12 and 18, 19, 20, 21, 22, etc. The midpoints 
would be multiples of 3 in the one case and of 5 in the second case. We use 
score limits so much more than we do midpoints, however, that the arguments 
seem mostly to favor beginning intervals consistently with the multiples of the 
size of interval, even when the size is three or five units. 

Score Limits of Class Intervals. We shall follow the usual practice here, 
placing in the lowest interval all scores of 10, 11, 12, 13, and 14; in the next 
higher interval, scores of 15, 16, 17, 18, and 19; etc. (see Table 3.2). Instead 
of writing out all the scores for each interval, we give only the bottom and top 
scores. Our intervals are then labeled 10 to 14, 15 to 19, 20 to 24, etc., or, 

1 While the rules as just stated will be satisfactory for most purposes, some variations 


will be presented later in connection with grouping for graphic representation of distribu- 
tions and for estimating a mode (see Chap. 4). 
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more often, 10-14, 15-19, 20-24. The bottom and top scores for each inter- 
val represent what we call the score limits of the interval. They do not indi- 
cate exactly where each interval begins and ends on the scale of measurement. 
The score limits are useful primarily in tallying and in labeling the intervals, 


TABLE 3.2. FREQUENCY DISTRIBUTION OF THE INK-BLOT SCORES THAT WERE 
Listep iN TABLE 3.1 


27 = 50 = N 


Exact Limits of Class Intervals. We shall soon find that in computations 
we must think in terms of exact limits. Remember that a score of 10 actually 
means from 9.5 to 10.5, and that a score of 14 actually means from 13.5 to 
14.5. This means that the interval containing scores 10 to 14 inclusive 
actually extends from 9.5 to 14.5 on the measurement scale. Likewise, the 
interval having score limits of 15 and 19 has exact limits of 14.5 and 19.5 on 
the scale. The interval labeled 55 to 59 actually extends from 54.5 to 59.5. 
The same principle holds no matter what the size of interval or where it 
begins. An interval labeled 14 to 16 includes scores 14, 15, and 16 and 
extends exactly from 13.5 to 16.5. An interval labeled 70 to 79 extends from 
69.5 to 79.5. It will be seen that by following this principle each interval 
begins exactly where the one below leaves off, which is as it should be (see 
Fig. 3.1). 

Tallying the Frequencies. Having decided upon the size of class interval 
and with what scores to start the intervals, we are ready to list them, as in 
Table 3.2. It is accepted custom to place the highest measurements at the 
top of the list and the lowest at the bottom, as shown here. Space is left in 
the second column for the tallying process, Taking each score in Table 3.1 


1 Strictly speaking, limits such as 69.5 and 79.5 also stand for very small distances rather 
than points. Only in a relative sense are they division points between intervals. Some 
writers define an interval such as the one containing scores from 70 to 79 as being actually 
from 69,500 to 79.4999. One could extend the zeros and nines indefinitely. For practical 
purposes, the “exact” limits of 69.5 and 79.5 will serve very well when measurements 


are integers. 
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as we come to it, we locate it within its proper interval and write a tally mark 
in the row for that interval. Having completed the tallying, we count up 
the number of tally marks in each row to find the frequency (f), or total num- 
ber of individuals falling within each group. The frequencies are listed in 
the third column of Table 3.2. 

Checking the Tallying. Next we sum the frequencies, and if our tallying 
has omitted none and duplicated none, the sum should equal the number of 
individuals. At the bottom of the column we find the symbol Zf, in which X 
(capital Greek sigma) stands for “the sum of” whatever follows it. Thus, 
Df is “the sum of the frequencies.” The total number of individuals or 
measurements in our sample is symbolized by the capital letter V, which 


245 295 


50, 51, 52, 53 54, 55 56 57 58 EE 


5 495 59.5 
Fic, 3.1. Exact limits of class intervals with different sizes of interval and of unit of measure- 
ment. 


Stands for "number." If Zf does not equal V, there has been a mistake in 
tallying, and tallying should be repeated until this check is satisfied. Even 
if Zf does equal V, there could have been a tally or two placed in the wrong 
interval. There is no way of checking this kind of error except by doing the 
tallying twice. The moral is that great care should be taken to make the 
finding of frequencies correct at the first attempt. 


GRAPHIC REPRESENTATION OF FREQUENCY DISTRIBUTIONS 


The frequency distribution in Table 3.2, particularly the array of tally 
marks, gives us a general picture of the group of individuals asa whole. We 
can see, for example, that the most frequent scores fell in the interval 25-29, 
that the very low and very high scores are more rare, and that the greatest 
bunching of scores comes in the lower half of the range. Much better pic- 
tures of this distribution are afforded in Figs. 3.2 and 3.3, however, where the 
general contour of the distribution is more accurately represented and the 
numbers of cases in the various intervals are more exactly shown, Figure 3.2 
is of the type known as frequency polygon, and Fig. 3.3 is of the type called 
histogram, or sometimes, though less often, column diagram. 
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The Frequency Polygon and How to Plot It. A polygon is a many-sided 
figure, and thus the picture in Fig. 3.2 derives its name. There are a number 
of factors to be kept in mind in drawing such a figure. 

The Kind of Graph Paper. First, it might be said that, in general, the most 
convenient type of cross-section paper is the type that is ruled into heavy 
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Fro. 3.2. A frequency polygon for the distribution of scores in the ink-blot test. 


lines 1 in. apart each way, subdivided into tenths of an inch more lightly 
drawn. 

The Width of the Diagram. Second, the question of the height and width 
of the entire figure arises. For the sake of easy readability, the width of the 
figure should be atleast 5in. We have altogether 10 class intervals in which 
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Fic. 3.3. A histogram for the same distribution as in Fig. 3.2. 


there are frequencies, but, in drawing the diagram, we should allow for one 
more class interval at each end of the scale, making 12 inall. This is to per- 
mit bringing the ends of the polygon down to the base line (see Fig. 3.2). 
Labeling the Base Line. In deciding how many intervals to allow to the 
inch, it is well to remember that we are going to label the base line of the 
figure in terms of our measuring scale and hence should plan things so that 
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Mo in. will stand for an integral number of units on this original scale. In 
the ink-blot data, we have been dealing with a class interval of five units, and 
we are making room for 12 intervals on our base line—in other words, for 60 
units. By allowing Vo in. to each unit (14 in. to each class interval), our 
distribution will spread over an extent of 6 in., which is sufficiently large. On 
the base line, therefore, we label every fifth line with а multiple of 5, beginning 
with 5 at the left and ending with 65 at the right. 

The Height of the Figure. The third important question is with regard to 
the relative height of the figure. For the sake of appearance and also for easy 
reading of the diagram, there is a general custom of making the maximum 
height of the distribution from 60 to 75 per cent of the total width. Our 
total width is 6 in., or 6910 in. Sixty per cent of this would be 3910 in., and 
75 per cent would be 4549 in. Our highest frequency, as we see in Table 3.2, 
is 12. By allowing 310 in. to the person, the height of 3810 would be 
attained, and by allowing 449 in. to a person а height of 4849 in. would be 
reached. The former comes within our rule, and the latter does not; there- 
fore we adopt 349 in. as the unit on the vertical scale. 

How to Locate a Midpoint. In order to plot a dot to represent the frequency 
in each class interval, we must next decide above what point on the base line 
the dot shall be. It is plotted exactly at the midpoint of the interval, and 
the midpoint is exactly midway between the exact lower and upper limits of 
theinterval. A simple rule to find the midpoint is to average either exact or 
score limits of the interval. The interval containing scores 10 to 14 inclusive 
has exact limits of 9.5 and 14.5. The entire range is 5 units. Half this 
range is 2.5 units, Go this far above the lower limit, and you have 9.5 plus 
2.5, or 12 exactly, as the midpoint. This could be written as 12.0. Or 
deduct 2.5 from the upper limit, 14.5 minus 2.5, and you also have exactly 
12.0 as the midpoint; or the average of 10 and 14 is 12.0. The midpoint of 
the interval 55-59 is 57.0. When the class interval is 5 and the lowest score 
in each interval is a multiple of 5, as will be true in many of the instances met 
in psychology and education, the midpoints will end in 2 and 7 systematically. 
For the sake of a complete picture of the midpoints for the data in Table 3.2; 
we have given in Table 3.3 the full set of midpoints. For a general illustra- 

tion of midpoints, see Fig. 3.4. 

Plotting the Points. Having determined the midpoints and knowing the 
frequencies corresponding to them, we are ready to plot the dots for the fre- 
quency polygon. For the two intervals at the ends of the distribution (see 
Table 3.3) we have frequencies of zero. Sometimes there are frequencies of 
zero not in the last two classes. When so, we plot these dots also on the base 
line and bring the lines that connect the dots down to the base line at those 
places. That did not happen to be the case in these data. When the dots 
are placed at the midpoints, as directed, it may be noted that they do not 
appear directly above the midpoints of the marked places on the base line 
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Midpoint 
unit 42.5 tunit A class interval 
of E uni 
42 43 
Midpoint 
2/5unifS 470 2/6 units A class interval 
r of. 5 units 
45 46 Б 48 49 
Midpoint 
Sunits 445 5unifs A class interval 
a a of 0 units 


40 41 42 43 44°45 46 47 48 49 
Fic. 3.4. Midpoints of class intervals with differing numbers of units. 


(5, 10, 15, 20, etc., in this case). Remember that these multiples of 5 are not 
the exact limits of the class intervals; they are merely convenient and mean- 
ingful reference points on our original scale. Had we begun the class inter- 
vals at scores other than multiples of 5—for example, at 11, 16, 21, 26, etc.— 
we should still plot at the mid-points of the intervals (now different than 


Taste 3.3. CLASS INTERVALS AND THEIR MIDPOINTS 
чуо alus if a ee 


Score limits | Exact limits | Midpoints | Frequencies 
60-64 59,5-64,5 62 0 
55—59 54.5—59.5 57 1 
50-54 49.5-54.5 52 1 
45-49 44,5-49.5 AT 3 
40-44 39.5-44.5 42 4 
35-39 34.5-39.5 37 6 
30-34 29.5-34.5 32 7 
25-29 24. 529.5 27 12 
20-24 19.5-24.5 22 6 
15-19 14.5-19.5 17 8 
10-14 9.5-14.5 12 2 

5-9 4,5- 9.5 7 0 


before) and should still label the reference points as multiples of 5, as in 
Fig. 3.2. The curve as drawn truly represents the shape of the distribution 
as we have grouped the scores. 

The Histogram and How to Plot It. Many of the facts learned in plotting 
the frequency polygon also apply in plotting the histogram. The choice of 
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size, proportions, units per square of graph paper all are the same, The 
only important difference is that, although we locate the height of each 
column or rectangle by placing a dot at the midpoint of each interval, we do 
not then connect dot to dot with straight diagonallines. Instead, we draw 
a short horizontal line through each dot (see Fig. 3.3), extending it to the 
upper and lower exact limits of each class interval. Those exact limits are 
given in Table 3.3 for our data. Having done this, we erect vertical lines at 
each of these exact limits tall enough to form complete rectangles. Again 
it may be noticed that the rectangles seem to be misplaced a half unit with 
respect to the numbers on the base line, but this is correct; the choice of 
limits for our classes makes the exact limits come a half unit below the multi- 
ples of 5, i.e., at 4.5, 9.5, 14.5, 19.5, etc. 

Advantages and Disadvantages of the Two Types of Figure. On the 
whole, the frequency polygon seems generally preferred to the histogram. 
For one thing, it gives a much better conception of the contour of the distribu- 
tion; the transition from one interval to another is direct and probably 
describes the distribution more accurately. The histogram gives a stepwise 
change from interval to interval, based upon the assumption that the cases 
falling within each interval are evenly distributed over the interval. The 
polygon gives the more correct impression that, on both sides of the highest 
point (directly above the mode), the cases within an interval are more fre- 
quent on the side nearer the mode, except where there are inversions in the 
general trend (as between scores of 15 and 25 in Fig. 3.2). 

On the other hand, the histogram gives a more readily grasped representa- 
tion of the number of cases within each class interval; each measurement or 
individual occupies exactly the same amount of area, One more advantage 
favoring the polygon is that when we wish to plot two distributions over- 
lapping on the same base line, as, for example, two different age groups or the 
two sexes, the histogram type gives a very confused picture, whereas the 
polygon type usually provides a clear comparison. 

Plotting Two or More Distributions When N Differs. The comparison of 
two distributions graphically raises a new question when the numbers of 
individuals in the two groups differ. With large differences, naturally, there 
is the question of scale, or how much space to give the figure. If the smaller 
distribution is large enough to be clearly legible, the larger one may extend 
beyond reasonable bounds. F urthermore, if it is general shapes and general 
positions on the measuring scale and dispersions that we wish to compare, 
the marked difference in size may make such comparisons very unsatisfactory. 
А common solution to this difficulty is to reduce both distributions to percent- 
age frequencies instead of plotting the original frequencies. It is then as if we 
had two distributions, each of whose N’s equal 100. This makes their two 
areas approximately equal in the polygon form, and comparisons of shape, 
level, and dispersion are then quite satisfactory. 
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How to Find Percentage Frequencies. As an example of how to transform 
frequencies into percentages the data in Table 3.4 are presented. In each 
case, the frequencies in the distribution are each multiplied by 100, then 
divided by N. A shorter procedure would be to find the quotient 100/N to 
four or more decimal places, then multiply each frequency in turn by this 
ratio. In distribution I, the ratio is 100/51, which equals 1.9608, and in dis- 
tribution II it is 100/160, which equals 0.6250. Multiplying each frequency 
fı by 1.9608, we obtain the list of percentages in column 4, and multiplying 
each frequency fe by 0.625, we obtain the list in column 5. Plotting these 
percentages above the corresponding midpoints of class intervals, we obtain 
the distribution curves in Fig. 3.5. Although it was apparent in Table 3.4 
that the second group were higher on the scale than the first and that there 


TABLE 3.4. FREQUENCY DISTRIBUTIONS OF SCORES IN A COLLEGE-APTITUDE TEST 
FoR FRESHMEN AT Two DIFFERENT COLLEGES 


(1) 


Scores fi J Р; Р, 
140-149 8 5.0 
130-139 32 20.0 
120-129 48 30.0 
110-119 18.1 

11.2 


was still considerable overlapping of scores between the two, these facts are 
more clearly brought out in graphicform. Also much cleareristhesomewhat 
narrower dispersion in the second group as compared with the first. 
Skewed distributions. In addition, the fact is more clear that the first 
group bunches at the left in its own range and has relatively few high scores, 
whereas the second group bunches at the upper end of its range, with rela- 
tively few low scores. We describe the first distribution as being positively 
skewed. (pointed end toward the right, or positive direction) and the second 
distribution as being negatively skewed (pointed end toward the left, or nega- 
tive direction). ‘The greater irregularity of contour in the first distribution is 
probably due to the small number of cases originally in this group. The 
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changing of the two distributions to the percentage basis has not changed the 
contour, only the general vertical size of the curves. 

Comparison of Two Histograms. The same two distributions as illustrated 
in Fig. 3.5 may also be shown in the form of histograms. When overlapping 
histograms become rather involved and confusing, writers sometimes resort 


reduced to a percentage basis. 
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Fie, 3.5. Distributions of scores in an aptitude test in two colleges. Frequencies have been 
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Fro. 3.6. Same distributions as represented in Fig. 3.5 shown in the form of two histograms. 


to the device shown in Fig. 3,6. In that illustration, a mirror reflection is 
pictured for one of the distributions, but both are drawn on the same horizon- 
tal scale, The frequency scale (in terms of percentages here) is repeated, 
also in mirror reflection, The shading of the rectangles is optional, but it has 
the virtue of making the entire surface within each histogram stand out from 


the page. 
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Other Variations in Presenting Overlapping Curves. The distributions 
in Fig. 3.5 are clearly represented as shown in two overlapping polygons. 
There are certain instances in which such line drawings will not suffice. One 
of these is when the two distributions are so extensively overlapping that 
there is considerable crisscrossing of lines and only confusion would result 
unless something is done about it. Figure 3.7 demonstrates such a situation 
and also how the matter is handled, namely, by showing the one polygon in a 
dotted line. By inspection спе can readily see to which group all parts of a 
polygon belong. The groups are identified, each with its type of line, by 


—-Examined in 
Sept (942. М= 5027 


|I—— Examined in 
Sept. 1943. N "3348 


Percentage frequency 


High school College 
Last year of schooling completed 


Fic. 3.7. Two overlapping frequency polygon’ representing distributions of years of 
schooling completed by samples of aviation students in the AAF. 


giving the code, in this instance, in the upper right part of the chart. Figure 
3.7 also includes desirable information such as is lacking in Fig. 3.5, namely, 
the total number of individuals in each sample. 

Figure 3.8 gives another demonstration of overlapping distributions that 
call for several different kinds of lines. This is generally desirable when 
there are more than two polygons on the same chart and when there is any 
overlapping at all. 

Figures 3.7 and 3.8, particularly, demonstrate how much meaning one can 
extract from pictorial representations of frequency distributions. Questions 
of policy governing the selection and training of aviation students during 
World War II hinged upon questions of age and of formal education of 
recruits, and it was important to maintain a clear picture of changing status 
of the trainees in these respects. From Fig. 3.7, for example, one would con- 
clude that the typical recruit was a high-school graduate and that men of this 
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category comprised more than half of all recruits. It might have been sur- 
prising to some of the commanding officers to find that there were recruits 
with as little formal schooling as eight years who could pass the Army Air 
Force qualifying examination. Those with less than 12 years of school were 
in very small percentages, however, and either this type of man did not apply 
in large numbers for aircrew training or he was screened out quite generally 
by the qualifying examination. The fact that the two curves, for samples a 
year apart, are almost identical throughout indicates that the same kind of 
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Fro. 3,8. Three overlapping frequency polygons representing distributions of chronological 
ages of aviation students in the AAF. 


men, so far as previous education was concerned, were applying and quality- 
ing for admission to AAF flying training. 

The distributions of aircrew recruits as to chronological age (Fig. 3.8) tell 
quite a different story. Within the same Period of a year, although the same 
range of ages prevailed (it was limited by regulations), there was a drastic 
trend toward reduction of age. This is shown by the fact that the mode (age 
having the greatest frequency) was at 23 years in the September, 1942, sam- 
ple, at 21 years in the March, 1943, sample, and at 19 in the September, 1943, 
sample. The skewing was slightly negative in the earliest sample and 
markedly positive in the latest sample. In one of the samples there was a 
secondary mode at 27 years. This reflects the known fact that many 27-year- 
old men expedited their entrance into AAF flight training in order to ensure 
acceptance before reaching the age limit. 

Smoothing a Frequency-distribution Curve. Any set of measurements 
like those in Fig. 3.5 is usually regarded as one sample out of a larger popula- 
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tion having practically the same properties as the ones obtained in the sam- 
ple. The first group is one of freshmen entering a certain college in a given 
year. If it is assumed that over a run of years the kind of students seeking 
entrance and the kind accepted remain about the same, the 51 students whose 
scores are given here may be said to represent the larger population. Had we 
obtained similar scores for this larger population, the irregularities seen in 
Fig. 3.5 would no doubt have been minimized. 

We frequently wish to forecast, from the supposedly representative sample 
that we have, how a larger population would distribute itself. To do this, 
we smooth the frequency distribution in the following manner. We predict 
from the frequencies we have what the corresponding frequencies would be 
in the larger population by a system of runningaverages. In this process, we 
permit the two frequencies on either side . e., in the immediately neighbor- 
ing intervals—to help determine the expected frequency in any class. In 
Table 3.5, the obtained frequencies f, are given in column 2, and it will be 

TABLE 3.5. ORIGINAL AND SMOOTHED FREQUENCIES FOR A DISTRIBUTION OF 
Scores IN A ScmoLasrIC-APTITUDE TEST 
(1) 2 (3) 
Scores J. te 


120-129 0 0.25 
110-119 1 0.50 
100-109 0 1.00 
90- 99 3 2.75 
80- 89 5 4.75 


noticed that two class intervals have been added at the ends of the range of 
scores. 

Running Averages of Frequencies. As a first illustration of the running- 
average method, let us apply it to finding the expected frequency f, in the 
interval 70-79. The obtained frequency here is 6. We average this along 
with the two immediately neighboring frequencies, 5 and 14. But we allow 
the middle frequency to carry twice as much weight, and so we add it twice: 
5--6--6--14 = 31. We have added four numbers, and so we divide by 4, 
obtaining 314 = 7.75. This is our predicted frequency for the interval 70-79. 
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Doing the same for the interval 40-49, we have 7 + 11 -- 11 +4 = 33. 
Divided by 4, this becomes 8.25. For the interval 30-39, we have 11 + 4 + 
4 + 0, divided by 4, which gives us 4.75. If we wish to do so, we may even 
estimate frequencies in the end classes given, for example, in the interval 
20-29. Here we have 4 + 0 + 0 + 0 = 4, and divided by 4 the outcome is 
1,00, All the expected frequencies for this distribution are given in column 3 
of Table 3.5. Their sum is equal to 51, which is a rough check upon the 
accuracy of computation. 

Plotting a Smoothed Distribution. The final step is to plot the smoothed 
curve, which we have in Fig. 3.9. First the obtained frequencies are plotted 
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Scores 
Fro, 3.9. A smoothed distribution curve for the scholastic-aptitude scores in Table 3.5. 
The circlets represent obtained (observed) frequencies. Dots represent new (smoothed) 
frequencies estimated by the use of running averages, 


as circlets in their proper places. It is always well to show these even though 
we do not draw the curve through them as before. The expected frequencies 
are next plotted as points. We can probably see by inspection that the 
smoothing could be improved upon. In drawing the smoothed curve, we do 
not feel compelled necessarily to touch all the dots. Being concerned with 
the general shape freed from probably accidental fluctuations, we take the 
liberty of further smoothing by inspection and by freehand drawing. If there 
were too many irregularities, even in the smoothed points, we could, of course, 
repeat the averaging process, but this is usually not wise, because it tends to 
flatten the entire distribution too much and should be avoided if possible. In 
the present instance, very little further adjustment of frequencies was needed 
in order to produce the smoothed and rounded contour seen in Fig. 3.9. We 
may expect with some confidence that the larger population from which this 
group is drawn will distribute more like the rounded curve than like the 
irregular one we actually obtained. 
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When Coarse Grouping Is Desirable. It was indicated in an earlier foot- 
note that there are occasions when the rules given for size and number of 
class intervals should be modified. In making a graphic representation of 
data it is often desirable to reduce the number of class intervals, even below 
10, and to make the intervals correspondingly larger. Doing so will often 
provide a much better picture. 

In small samples (for this particular purpose we may define a small sample 
as one with an M less than 100), with fine grouping, the frequencies are likely 
to be irregular. Sometimes the effect upon the graphic figure is to produce а 
"'saw-tooth" contour. It is very probable that the population distribution, 
if we had it, would be smooth and regular. Since we usually want the sample 
distribution to reflect the general picture of the population from which it came 
and which it is supposed to represent, we would like to avoid those irregu- 
larities. One solution already offered is that of smoothing the distribution 
curve, There are some who object to smoothing as the remedy, and for them 
there is another possibility. In general, curves will be more regular if group- 
ing is coarser. 

Another aspect to this problem is that the particular frequencies we obtain 
by grouping are strongly dependent upon the choice we make in starting each 
class interval. With the same size of class interval, we might derive quite a 
different-appearing frequency polygon simply by making our division points 
between classes at other places, particularly if the sample is small. One can 
readily demonstrate this by choosing an appropriate interval of 3, let us say, 
and by setting up three distributions, starting the lowest interval at 12, 13, 
and 14, respectively, when the lowest score is 14. By introducing coarser 
grouping, this phenomenon, too, tends to be counteracted. 

Another consideration in this grouping problem is the position of m mode, 
i.e., the point on the measurement scale corresponding to the highest point 
on the frequency curve. As different sizes of interval are utilized, and as 
different starting points for intervals are chosen, so the mode may shift up or 
down on the measurement scale, even jumping from interval to interval. 
Coarser grouping will also tend to stabilize the interval and the value of the 
mode. 

Based upon certain mathematical considerations which we cannot go into 
here, Kelley has proposed that the number of classes to be utilized in the 
graphic representation of a distribution should be determined roughly from 
the size of sample, as shown in Table 3.6. 

From the information given in Table 3.6, one would be justified in using 
only eight classes for the ink-blot-test data, which have been used so exten- 
sively for illustrations in this chapter. This number of classes would mean a 
class interval of 6, which could, of course, be used, though it is not in the pre- 
ferred list. An interval of 10, which és in the preferred list, would result in 
only буе classes, which would be less than are called for in Table 3.6. Remem- 
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ber that the coarser grouping is called for, thus far, only for the purpose of 
graphic representation. The requirement of 10 or more classes still holds for 
computations such as we meet in the chapter to follow. Since one is often 
faced with the need of both graphic and computational use of data, some kind 
of compromise is practically desirable and defensible in many instances. The 
illustrative example is probably such an instance. The 10 classes used for 
the ink-blot data yield a frequency polygon which is rather regular, with one 
notable inversion, and the same 10-class distribution will serve for the compu- 
tations required. The reader will be reminded later (see page 95), however, 


TABLE 3.6. THE NUMBER OF CLASSES TO UsE IN PREPARING FREQUENCY 
DISTRIBUTIONS FOR GRAPHIC REPRESENTATION FOR DIFFERENT SIZES OF SAMPLE* 


Sample Size (N) Number of Classes 
4 5 2 
6- .8 3 
9- 14 4 
15- 21 5 
22- 32 6 

33- 46 7 
47- 64 8 
65- 89 9 
90-117 10 
118-153 11 
154-192 12 
193-255 13 
256-315 14 


* From Kelley, T. L. Fundamentals of Statistics. Cambridge, Mass.: Harvard University Press, 
1947. P.133. Reproduced by permission. 
that with less than 12 classes it is necessary to make certain corrections for 
“ M А Н : д 

grouping errors" when certain accurate computations are desired. 


Exercises 


d For each one of the following ranges of measurements, state your judgment of (1) the 
best size of class interval, (2) the score limits of the lowest class interval, (3) the exact limits 
of the same interval, and (4) its midpoint. 


а. 83 to 197. b. 4 to 39. 
c. 17 to 32. d. 35 to 96. 
€. 0 to 188. 7. —24 to 4-28. 


g. 0.141 to 0.205. 


2. Given the following list of scores in a “nervousness” test (Data 34) and using a class 
interval of 5, set up a frequency distribution. In the first solution, begin the lowest class 
interval with a score of 35. List all exact limits of class intervals and also exact midpoints. 
Tn a second solution, start the lowest class interval with a score of 33. After finishing both 
solutions, write out a comparison of the two distributions and defend the choice of the one 
as against the other. 
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Dara 3A. Scores IN A Nervousness INVENTORY 


3. Given the following list of scores, each of which is the percentage of 400 words judged 
pleasant by an individual (Data 3B), set up a frequency distribution making the wisest 
choice of class interval and class limits. 


Dara 3B. Arrectiviry RATIOS 
(All have been rounded to the nearest whole number) 


4. Plot a frequency polygon and a histogram for Data 3C, group I. State your conclu- 
sions about these data as revealed by your plotted distributions. 


Data 3C. DISTRIBUTIONS OF CHEMISTRY-APTITUDE SCORES IN TwO FRESHMAN 
Cuemistry Courses, I AND II 


Frequencies | Frequencies 


SES for group І | for group П 
90-94 4 2 
85-89 10 0 
80-84 14 0 
75-79 19 0 
70-74 32 2 


65-69 31 4 
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5. Apply the smoothing process described in this chapter to Data ЗС, group I. Plot 
a curve based upon the smoothed frequencies but show the original frequencies as points, 
as was done in Fig. 3.9. In what respects has smoothing changed the picture of these 
data? 

6. Reduce distributions I and II (Data 3C) to percentage distributions, and plot them 
on the same diagram, Make a descriptive comparison of the two distributions as drawn, 


Answers 
1, 

i Score Limits Exact Limits 
a. 10 80-89 79.5-89.5 
b. 3 3-5 2.5-5.5 
6. 1 17 16.5 17.5 
d. 5 35-39 34.5-39.5 
e. 20 0-19 —0.5-19.5 
s ONE] —25 to —21 —28.5to —20.5 
g. 005 0,140-0, 144 0.1395-0.1445 


2. Frequencies, first solution: 5, 4, 4, 8, 11, 12, 11, 6, 2, 1; second solution: 1, 4, 5, 5, 8, 
13,13, 8, 5, 1, 1. 
3. Frequencies (i = 3, with lowest interval at 30-32): 1; 1; 2; 4; 8; 9; 9; 16; 8; 3; 2; 1. 
5, Smoothed frequencies: 
we І. 1.0; 4,5; 9.5; 13.2; 21.0; 28.5; 33.5; 34.8; 31.2; 26.8; 22.2; 16.8; 11.0; 5.8; 2.8; 1.8; 
II. 0,5; 1.0; 0.5; 0.0; 0.5; 2.0; 3.8; 6.5; 10.5; 14.8; 19.0; 20.5; 19.5; 18.2; 12.2; 4.0; 0.2. 
6. Percentages: 
L 1.5; 3.8; 5.3; 7.1; 12.0; 11.6; 15.0; 10.5; 10.9; 7.9; 6,8; 3.8; 2.3; 0.4; 1.1. 
II. 1.5; 0.0; 0.0; 0.0; 1,5; 3.0; 3.7; 9.0; 9,7; 15.7; 15.7; 14.2; 14.9; 10.4; 0.8. 
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CHAPTER 4 


MEASURES OF CENTRAL VALUE 


This chapter is about averages, of which there are several kinds. Three of 
them—the arithmetic mean (or mean, for short), the median, and the mode— 
will be explained here. Two others, the geometric mean and the harmonic 
mean, being much less useful, will be briefly mentioned. 

An average is a number indicating the central value of a group of observa- 
tions or of individuals. То the question, How good is a sixth-grade class in 
arithmetic?" the most reliable and meaningful kind of answer would be the 
mean or median in some acceptable test of arithmetical achievement. To 
the question, “What is the weakest tone to which this dog will respond?” 
the best kind of answer is to state the average result from a number of trials. 
Tn either case a single score or a single measurement of the threshold stimulus 
would be highly unreliable, for not all measurements, even from repeated 
observations of the same thing, have the same value. To answer those ques- 
tions by reciting the long list of individual measurements would be highly 
uneconomical in the reporting and not very enlightening to the questioner. 

The average, whether it be a mean, median, or mode, serves two important 
purposes. First, it is a shorthand description of a mass of quantitative data 
obtained from a sample. It is surely more meaningful and economical to let 
one number stand for a group than to try to note and remember all the 
particular numbers. An average is therefore descriptive of a sample obtained 
at a particular time in a particular way. Second, it also describes indirectly 
but with some accuracy the population from which the sample was drawn. 
If the sample of sixth-grade children is representative of all the sixth-grade 
children in the same school, in the same city, or even in the same county, then 
the average of their scores tells us much about the average that would be 
made by the population that they represent, be it school-wide, city-wide, or 
county-wide. If we examine the dog's hearing under a set of conditions that 
is characteristic of his general, day-to-day existence, the sample average will 
be very close to one that we could actually obtain by testing him day after 
day on many days. 

It is only because sample averages are close estimates of larger population 
averages that we can generalize beyond particular samples at all and make 
predictions beyond the limits of a sample. This means considerable economy 
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of effort, but, far more important than that, it makes possible all scientific 
investigation. We rarely or never know the average of a population; conse- 
quently we do not know by how much our obtained average has missed it, 
but if our sampling has been done in the proper manner we can estimate about 
how far we may have missed it, as will be shown in Chap. 9. In the present 
chapter we shall be concerned only with the methods of computing averages 
from sample data. 


THE ARITHMETIC MEAN 


The Mean of Ungrouped Data. Most readers already know that to find 
the arithmetic mean (popularly called the average), we sum the measurements 
and then divide by the number of measurements or cases. In terms of a 
formula 

— 2x 


N (The arithmetic mean) (4.1) 


where M = arithmetic mean 

У = "the sum of” 

X = each of the measurements or scores in turn 

N = number of measurements or scores 
In a certain experiment to determine the lowest frequency of vibration of a 
sound wave that would yield a tone for a human observer, 10 trials were 
given, with the following results: 13, 17, 15, 11, 13, 11, 17, 13, 11, 11 (cycles 
per second). The sum of these measurements is 132, and therefore the mean 
is 13.2 cycles per second. Note that in reporting a mean it is given in terms 
of the unit of measurement, which is specifically stated. A mean is never an 
abstract number; it is always a mean of something and is always in terms of 
some unit of measurement. 

As another example, the scores on the ink-blot test found in Table 3.1, when 
summed, give ZX equalto1,480. The mean, with the use of formula (4.1), is 
ex dapi 
M= ЯГ" ЛЫТ 29.60 
The mean ink-blot score is 29.60 score units. In practice, it is quite custom- 
ary in reporting a mean to round to one more figure at the right than the 
original measurements had—in this case, to keep one decimal place, where the 
original scores were whole numbers. We report the mean as 29.6 score units. 
The Mean of Grouped Data. When data come to us grouped, or when they 

are too lengthy for comfortable addition without the aid of a calculating 
machine, or when we are going to group them for other purposes anyway, we 
find it more convenient to apply another formula for the mean: 


1 One could determine the number of accurate, significant figures in a mean by applying 
the rules of Chap, 2, at each step of the operations. 
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M = MX iy (Arithmetic mean from grouped data) (4.2) 


where the symbols N and Z have the same meaning as before, X; = midpoint 
of a class interval, and f = number of cases within the interval. 


TABLE 4.1. COMPUTATION OF THE MEAN IN GROUPED DATA 


(1) (2) (3) (4) 
X. 
EU NEE E NEZ 
55-59 57 E 57 
50-54 52 1 52 
45-49 47 3 141 
40-44 42 4 168 
35-39 37 6 222 
30-34 32 1i 224 
25-29 27 12 324 
20-24 22 6 132 
15-19 17 8 136 
10-14 12 2 24 
Sums.... t 50 1,480 
x. 
T —— 


XX; _ 1,480 _ 
Mean = ану) 29.60 


The solution by way of this formula is illustrated in Table 4.1. Here we 
have only as many different X values as there are class intervals, instead of 
as many as there are original measurements. Each class interval has as its X 
value the midpoint of that interval, which is given the special symbol X;. 
This practice assumes that the midpoint of the interval correctly represents 
all the scores within that interval. This will not be exactly true in many 
instances, but the discrepancy is small in any case and, in computing the 
mean, most of the discrepancies tend to counterbalance others, giving a mean 
that is essentially correct.“ 

In column 2 of Table 4.1, the midpoints of the intervals are given. We 
must add each midpoint into our total as many times as there are cases within 
that interval. This means finding for each interval the product of f times 
X; or X., The fX; products are listed in column 4. The sum of the fX; 
products (ZfX;) is equal to 1,480. Dividing this by N, we find the mean to 
be 29.60, as it was for the same data ungrouped. As was indicated before, 
we should not be surprised to find a minor discrepancy between the means 


1 A discussion of “grouping errors” and their effects upon statistics will be found in the 
next chapter. 
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calculated from grouped and ungrouped data. It happened here that the 
discrepancy was zero. We may also expect trivial discrepancies in means 
when the same data are grouped differently, i.e., with different size of class 
interval or with different starting points for intervals of the same size. 

The Mean Computed from Coded Values. When the original measure- 
ments are relatively large numbers, particularly when the midpoints and the 
frequencies are large numbers, the method just described can well give way 
to a short-cut procedure that saves pencil-and-paper work. Even greater 
saving is appreciated when, as in the next chapter, a standard deviation is 
also to be computed. This procedure requires the use of “coded” values to 
replace the midpoint values. ы 

The steps are illustrated in Table 4.2, including the coding process. In 
this table it can be seen that many of the actual midpoints would be four- 
place numbers; for example, the highest interval has a midpoint of 154.5 
(midway between 149.5 and 159.5). Consequently, the fX; products would 
also be rather large. The coded values for the intervals, given in column 3, 
are called x’. They will now be explained. 

The Coding Process. First, we select a new origin. The new origin is that 
particular X; value that we choose to call zero. In order to obtain the great- 
est benefit from the coding method, it is well to choose the origin near the 
center of the distribution. If there is an odd number of class intervals, the 
midpoint of the middle one is a good candidate for the origin. If there is an 
even number of class intervals, either of the two middle ones would do. 

There are other considerations, however. When the distribution is rather 
skewed, as in the case of the data in Table 4.2, the middle of the data is not 
likely to be in the middle interval or intervals. Another solution is to select 
the midpoint of the interval containing the median (see Table 4.3 for the 
method of finding a median). The median is in the interval 80-89. This is 
farther from the center of the range than we would ordinarily go to place the 
origin. A good compromise, then, seems to be the interval 90-99, with its 
midpoint of 94.5. 

We could now find new midpoint values by subtracting 94.5 from the mid- 
point values X; of all intervals represented in Table 4.2. These would range 
from —40.0 for the lowest interval to +60.0 for the highest interval, with a 
midpoint of 0.0 for the interval 90-99. The extreme values are still some- 
what large; consequently, we proceed to make them smaller by dividing them 
all by 10, the size of the class interval. The result gives the x’ values of 
column 3. We now have simple integers. Some of them are negative, which 
complicates things a bit, but this is the only price we pay for obtaining small 
code values with which to work. 

Let us next proceed to find the mean of the coded values. The steps are 
much the same as those taken in Table 4.1. One difference is that some of 
the midpoint values (x’) are negative, and great care must be maintained to 
take this into account. The sum of the positive fx’ products is +56, and the 
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sum of the negative fx’ products is —68. The algebraic sum of all the fx’ 
products is 56 — 68, which is —12. The Zfx'is therefore —12. The mean 
of the x' values is given by a formula like (4.2): 

Zfa' 


My = Е (Mean of coded values) (4.3) 


For the data of Table 4.2, My = —0.188. 


TABLE 4.2. COMPUTATION OF THE MEAN IN GROUPED DATA BY USING THE 
Cope Метнор 


F 
(1) (2) (3) (4) 
Scores ap a Sx! 
150-159 2 +6 +12 
140-149 2 +5 +10 
130-139 4 +4 +16 
120-129 1 +3 +3 
110-119 5 +2 +10 
100-109 5 +1 +5 
+56 
90- 99 12 0 0 
80- 89 | 10 -1 —10 
70- 79 12 —2 —24 
60- 69 10 —3 —30 
50- 59 1 —4 — 4 
—68 
Sums...... 64 —12 
N Zfx' 
кыыз ipie [olii —— 
My = 58 = —0.188 


M. = 10(—0.188) + 94.5 = 92.62 


Uncoding the Mean. То obtain from this value the mean of the original 
measurements we must go through the process of “‘uncoding.’} The coding 
process involved two steps—subtracting 94.5, then dividing by 10. We can 
describe this in general terms by the equation 


CCC ten rae eter SEHE шылаш) И) 


where X, is the midpoint value chosen for the origin of the coded values and 
other symbols are as defined before. The uncoding proceeds in reverse. The 
two steps include multiplying by i, then adding Хо. In terms of an equation, 


М, =iM,z-+ Хо (Mean of measurements, from mean of coded values) (4.5) 
Substituting the necessary values in formula (4.5), 


М. = 10(—0.188) + 94.5 
= 92.62 
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A Summary of the Code Solution of the Mean. The steps involved in the 
code method of computing the mean may be summarized as follows: 


Step 1. Set up the frequency distribution. 

Step 2. Choose a temporary origin, Xo. This is the midpoint of the interval 
(1) near the center of the range, or (2) containing the median, or (3) 
a compromise between the two. 

Step 3. Assign to the class intervals new small, integral values, starting with 
zero at the origin, with positive values above it and negative ones 
below. Call these new values x’. 

Step 4. Find the fx’ product for each interval, and record all such values in a 
column. 

Step 5. Sum the fx’ products algebraically. This is Zfx". 

Step 6. Divide the sum of fx’ products by N, giving Mz, the mean of the 
coded values. 

Step 7. Multiply this quotient by i, the size of class interval. 

Step 8. Add this algebraically to Xo, which gives the mean Mz. 

A single formula representing the last three steps is 
Zfx' 


M=Xoti (3) (Arithmetic mean from grouped and coded data) (4.6) 


THe MEDIAN 


The median is defined as that point on the scale of measurement above 
which are exactly half the cases and below which are the other half. Note 
that it is defined as a point and not as a score or any particular measurement. 
If this conception is kept clearly in mind, many difficulties will be forestalled. 


Taste 4.3. COMPUTATION OF THE MEDIAN SIZE OF CLASS IN A CERTAIN SCHOOL, 
WITH THE Use or GROUPED Data 


Class size| f 
40-44 | 91 
35-39 0 
30-34 3 
25-29 5 
20-24 3 12 = number of cases above the interval containing the median 
15-19 10 
10-14 1 ee f А РЕ " 
59 1 = number of cases below the interval containing the median 
0-4 4 
N-28 


Mdn = 145 + %о X 5 = 14.5 + 40 = 18,5 
Mdn = 19.5 — 349 X 5 = 19.5 — 10 = 18.5 


сн. 4] MEASURES OF CENTRAL VALUE 59 


The Median from Grouped Data. It is probably easier to grasp the process 
of computing a median in grouped data. For a first illustration, consider 
Table 4.3. Here there are 28 cases, and so the median is that number of 
points on the measuring scale above which there are 14 cases and below which 
there are 14. Counting frequencies from the bottom upward, we find that 
4+1-+1-+ 10 = 16 cases, or 2 more than we want. To make 14 cases, 
we need 8 out of the 10. The median lies somewhere within the interval 
15-19, whose exact limits are 14.5 and 19.5. We assume for the sake of 
computation that the 10 cases within this interval are evenly spread over the 
distance from 14.5 to 19.5 (see Fig. 4.1). We must interpolate within this 
range to find how far above 14.5 we need to go in order to include the eight 
cases we need below the median. We 
must go 840 of the way, for 8 is the 
number we require, and 10 is the total 
number in the interval. The total dis- 4 Cases are 10 
tance is 5 units, and so on the scale of above 18.5 
measurement we go 849 of 5, or exactly 
4.0 units. Adding this 4.0 to the lower 
limit of the class interval 14.5, we get 
14.5 + 4.0 = 18.5 as the median. 

We can check this by counting down 4 Cases are 
from the top of the distribution until we below 18.5 
include N/2 of the cases, 14 in this 


2 Cases are 
above 19.5 


18.0 


problem. Starting at the top, we find 15.5 
that 150 
1+0+3+5+3=12 45 
ч 6 Саѕеѕ аге 
We need two more cases out of (ће next below 14.5 


group of 10. We must go 240 of the Fre. 4.1. Showing how the 10 cases in the 

way below the upper limit of the inter- interval 14.5 to 19.5 are distributed. 

val, ie, below 19.5. This means 210 Each case is assumed to occupy a tenth 
* of the interval, or one-half of a score 

of 5, or exactly 1.0 unit. The upper unit, The eighth one extends up to the 

limit, 19.5 minus 1.0, gives us 18.5 for point 18.5, which is the median. 

the median, which checks with the one 

obtained by counting up from below. It is well always to check the deter- 

mination of a median in this manner, and to do so involves very little 

work, If the two estimates do not agree exactly, something is wrong. 

To take another example with grouped data, consider Table 4.4, where № 
isanodd number. Here N/2 is 18.5, but the principle of interpolating within 
an interval for the exact median is just the same. Counting up from below, 
we find that 1 + 5 + 8 = 14, which lacks 4.5 cases of including the lower 
half. In the next interval, we must go 4.5/8 of the way, or 4.5/8 times 2, 
which equals 9$, or 1.125. Adding this many units to the lower limit of the 
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TABLE 4.4. COMPUTATION OF THE MEDIAN SCORE IN A SENTENCE-CONSTRUCTION 
Test As GIVEN TO 37 MEN 


Scores Í 
37-38 1 
35-36 2 
33-34 0 
31-32 1 
29-30 0 
27-28 6 15 = number of cases above interval containing the median 
25-26 5 
23-24 8 
21-22 8 14 = number of cases below interval containing the median 
19-20 5 
17-18 1 
vast 4135 
45 9 
Mdn = 22.5 + F X2 = 22.5 + co 22.5 + 1.125 = 23.6 
Min = 245 — ŠŠ x 2 = 24.5 — = 24.5 — 875 = 23.6 


interval (22.5), we have 23.625 as the median; or dropping all but one decimal 
place, we report the median as 23.6 score units. Checking by counting down 
from the top, we find 15 cases above the point 24.5. Going 3.5/8 of the way 
down into the interval of 2 units, we find that we must deduct 0.875 from 24.5 
to find the median. When rounded to one decimal place, the median is 23.6, 
as before. In terms of a formula, the interpolated median is found from 
below by 

DER 


2 3 
Mdn =I 7 ) (Interpolation of a median from below) (4.7a) 
7 


where? = exact lower limit of class interval containing the median, F, = sum 
of all frequencies below /, f, = frequency of the interval containing Mdn, and 
N and i are defined as usual. 

In terms of a similar formula, the median is found from above by 


He i (Interpolation of a median from above) (4.70) 


where м = exact upper limit of the interval containing the median and 
F, = sum ofall frequencies above u. Other symbols are as defined previously. 
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A Summary of the Steps for Interpolating a Median. The steps for com- 
puting a median from grouped data may be summarized as follows: 


Step 1. Find N/2, or half the number of cases in the distribution. 

Step 2. Count up from below until the interval containing the median is 
located. 

Step 3. Determine how many cases are needed out of this interval to make 
N/2 cases. 

Step 4. Divide this number needed by the number of cases within the interval. 

Step 5. Multiply this by the size of class interval. 

Step 6. Add this to the exact lower limit of the interval containing the 
median. 

Step 7. Check by adding down from the top to find to what point the upper 
half of the cases extend in a manner analogous to that described in 
steps 2 to 5 inclusive. 

Step 8. Deduct the number of score units found in step 7 from the exact upper 
limit of the interval containing the median. 


Some Special Situations. There are some instances in which things do not 
turn out just as they did in the two illustrative examples. 

When the Median Falls between Intervals. If it should happen, in adding up 
cases from below, that half the cases take in all the cases in the last interval, 
the median is then the exact upper limit of that interval. In counting down 
from above, it would be found that all the cases in the interval just above this 
one would also be required to make Л/2, and so its exact bottom limit would 
be the median. This coincides with the exact upper limit of the interval 
below; thus, the median checks. As an example, note the following fictitious 


data: 
20-24 | 25-29 | 30-34 | 35-39 | 40-44 | 45-49 | 50-54 | 55-59 
2 7 10 


Here N/2 is 34. This many cases takes us exactly through the interval 
35-39, 'The median is 39.5. From above down, we are carried through the 
interval 40-44, whose lower limit is 39.5. Again the median is 39.5. 

When There Are No Cases within the Interval Containing the Median. 
Another question arises when the median falls within an interval where there 
are no cases. It is even possible that, in the region of the median, two or 
more intervals have frequencies of zero. If the range having no cases is one 
interval, the median may be taken as the midpoint of that interval, but this 
gives a very crude estimate unless the size of the interval is small—for exam- 
ple, not over three units. If that range covers two or more intervals, no good 
estimate can be made for the median. 


Scores 


f 
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8-10 | 11-13 17-19 | 20-22 | 23-25 


In the data just preceding, the median is 15.0, which is midway between 13.5 
(to which point the lower half of the cases extend) and 16.5 (to which point 
the upper half of the cases extend). Or it is the arithmetic mean of those 
two limits, for 16.5 + 13.5 divided by 2 is 15.0. 

The Median from Ungrouped Data. Things learned in finding a median in 
grouped distributions should carry over almost intact to the use of ungrouped 
data. The median is a point on the measuring scale. In ungrouped data, 
each score: or measurement is assumed to occupy a range of one unit. The 
median either falls within one of those units or somewhere between units. 
The first step is to arrange the measurements in order of their size. The list 
of 10 measurements of the threshold for pitch as given on p. 54, when placed 
in rank order, becomes 


11, 11, 11, 11, 13, 18, 13, 15, 17, 17 


As in the case of grouped data, it is assumed that the four 11's occupy the 
range from 10.5 to 11.5; the three 13's occupy the range from 12.5 to 13.5, etc. 
Counting from below to include five cases brings us to the first 13 that must 
be included among the five. We must therefore extend one-third of the way 
in the interval of 1 unit, or 0.33 unit into the interval, starting at 12.5. The 
median is 12.5 + 0.33, which equals 12.83, or, when rounded, 12.8. In 
checking from above, the median is found at 13.5 — 0.7, which also equals 
12.8. 
In the series of measurements 


2,5, 7, 8, 9, 10, 17 


the median comes midway in the fourth one, which is 8, Since 8 occupies a 
range of 7.5 to 8.5, the median is the midpoint of this range, or exactly 8.0. 
Tn the series of measurements 


7, 9, 10, 12, 13, 15, 18, 20 


four are 13 or above, and four are 12 or below. The division between upper 
and lower halves comes at 12.5, which is the median in this case. In the 
array of scores 

15, 17, 18, 20, 23, 24, 27, 30 


the lower half extends up to 20.5, and the upper half extends down to 22.5. 
Midway between these two values is the point 21.5, or the average of the two. 

It is probably obvious that the median of so small a number of observations 
cannot be very reliable, and we should not place too much reliance upon it or 
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carry our calculations to more than one decimal place (we might even report 
nearest whole numbers); but in order to keep consistent certain principles of 
the median and of the process of computing it, certain steps have been 
emphasized. Whenever there is doubt concerning special cases not covered 
in these illustrations, an application of these principles should take care of 
the matter. 

THE Море 


The mode is strictly defined as the point on the scale of measurement with 
maximum frequency in a distribution. When we have ungrouped data, the 
mode is that measurement which occurs most frequently. Usually it is some- 
where near the center of the distribution, and in a strictly normal (Gaussian) 
distribution it coincides with the mean and the median. 

The Crude Mode. In a distribution of grouped data, the crude mode is the 
midpoint of that class interval having the greatest frequency. In Table 4.1, the 
highest frequency is 12, for the interval 25-29. The midpoint of this interval 
is 27, and so the mode is taken to be 27.0. In Table 4.2, there are two inter- 
vals with the same maximum frequency of 12. If these two intervals had 
been separated by more than one intervening interval of lower frequency, we 
should be justified in saying that the distribution is bimodal (having two 
modes). But the single intervening frequency of 10 hardly gives us sufficient 
basis for this conclusion. The distribution is therefore probably really uni- 
modal, but we are not able to decide upon its crude mode. A calculated 
mode can be found, as we shall soon see. 

In Table 4.3, the crude mode is clearly 17.0. In Table 4.4, the maximum 
frequency is shared by two neighboring intervals. In a situation like this, 
we do the reasonable thing of assigning the crude mode to the dividing point 
between these intervals, which is 22.5. Unless the data are reasonably 
numerous, so that there is clearly an interval of highest frequency, we should 
not attempt to assign a modal value to the distribution. For example, the 
10 measurements of threshold for pitch present an unusual situation, with the 
greatest frequency (four cases) at 11, which is at one end of the distribution. 
Following right behind is the measurement of 13, with three cases. Here it 
would be rather meaningless to say that the mode is 11. 

Estimation of the Mode by Coarse Grouping. In estimating the mode it is 
frequently helpful to resort to coarser grouping (smaller number of class 
intervals) than usual. This results in larger frequencies within the classes 
and usually larger differences between frequencies, so that there is less doubt 
as to which interval contains the mode. Following a recommendation of 
Kelley,! the optimal conditions for estimating the mode prevail when the 
numbers of classes are as given in Table 4.5. 


1Kelley, T. L. Fundamentals of Statistics. Cambridge, Mass.: Harvard University 
Press, 1947. P. 259. 
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Taste 4.5. OPTIMAL NUMBERS OF CLASSES FOR ESTIMATING THE MODE FOR 
DIFFERENT 512Е5 OF SAMPLE 


The Mode Estimated from the Mean and Median. Fortunately, because 
of certain mathematical relationships between the mode and the other two 
measures of central value, we can estimate the mode from them. A simple 
approximation formula is 


Mo = 3Mdn — 2M (Estimation of a mode from mean and median) (4.8) 


Tn other words, the mode equals three times the median minus two times the 
mean. 

Applying this formula, we can now estimate the mode of the distribution in 
Table 4.2, in which we were unable to decide upon a crude mode. The 
median for this distribution is 88.5, and the mean is 92.62. Although we 
rounded the mean to one decimal place in reporting it, in further calculations 
with it, we do well to keep the second decimal place. Applying formula (4.8), 
the computed mode equals 


(3 X 88.5) — (2 X 92.62) — 265.5 — 185.24 — 80.26 


Rounded to one decimal place, the estimated mode is 80.3, Reference to the 
distribution in Table 4.2 again will show that this point comes about midway 
among the four high frequencies. Had we done a very reasonable thing and 
placed the crude mode midway among these four intervals, it would have been 
at 79.5, which is less than one unit from the calculated mode. 

It may add meaning to the computed mode to say that it is the point on the 
measuring scale at which the smoothed distribution curve probably has its 
highest point. 


WHEN to EMPLOY THE MEAN, MEDIAN, AND MODE 


Certain Advantages of the Mean. The arithmetic mean is to be preferred 
whenever possible because of severa] desirable properties. In the first place, 
it is generally the most reliable or accurate of the three measures of central 
value. By this we mean that, from sample to sample of the same population, 
the mean will ordinarily fluctuate less widely. Another reason is that the 
mean is better suited to further arithmetica] computations. Deviations of 
single cases from the central value are important information about any 
distribution. Much is done with these deviations, as will be seen in the fol- 
lowing chapter. It will also be found that we square those deviations, and 
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this we are really justified in doing only when the deviations are taken from 
the mean. When distributions are reasonably symmetrical, we may almost 
always use the mean and should prefer it to the median and mode. On the 
other hand, there are instances, particularly when distributions are skewed 
and when the mean would lead to erroneous ideas about a distribution, in 
which other measures of central value are better used. 

A Comparison of the Mean with Median and Mode. One property of the 
mean is that it is sensitive to the size of extreme measurements when they are 
not balanced by other extreme measurements on the other side of the middle. 
In the following set of measurements, the mean is 9 and the median is 9: 


4, 5, 7, 9, 11, 13, 14 


Now, if the 14 had been 23 instead of 14, the median would be unchanged, 
but the mean would become 10. "here are still an equal number of cases 
above and below 9. So far as the median is concerned, the 11, 13, and 14 
could have been 110, 130, and 140, and stil] the median would be 9, But in 
this rather unusual but not impossible event, the mean would become 57.9, 
where formerly it was only 9. The conclusion to be drawn is that when, in a 
small sample particularly, there are any very extreme measurements not 
balanced by other extreme measurements in the other direction, the median 
is to be preferred to the mean. 

Some Mathematical Properties of the Arithmetic Mean and the Median, A 
better appreciation of the nature of the mean and of the median may be 
gained by noting some of their mathematical peculiarities, To illustrate, let 
us use the data presented in Table 4.6. There six scores are given for six 
individuals. ‘The mean of these scores is 6.0 and the median is 4.5. 


TABLE 4.6. ILLUSTRATION Or CERTAIN PROPERTIES or THE ARITHMETIC MEAN AND 
тик MEDIAN 


(4) (5) (6) 
Deviations | Deviations Deviations Deviations 
Person Score | from the | from the | from the mean, |from the median, 

mean median squared squared 

A 2 -4 —2.5 16 6.25 

B 3 -3 -1.5 9 2.25 

© 4 -2 -0.5 4 0.25 

D 5 -1 40.5 1 0.25 

E 9 +3 +4.5 9 20.25 

Е 13 +7 48.5 49 72.25 

— на 865 

Sums. 36 9.0 88 101.50 
6.0 
4.5 
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The first feature to be pointed out is that the mean is the center of gravity 
of the scores. In Fig. 4.2 we have the six scores represented оп the measure- 
ment scale. Imagine that the six individuals are arranged in their proper 
places along this scale. Imagine that the scale itself is a rigid plank or bar. 
The six persons may be regarded as exactly the same in all respects except for 
their scores on this scale. Each “weighs” the same; his effect upon the tilt- 
ing of the bar depends only upon his position upon it. If we wish to rest the 

‘bar upon a single fulcrum in such a position that the bar will be perfectly 
balanced, that position must coincide with the mean. The measurements in 
any sample are perfectly balanced about the arithmetic mean. 


Arithmetic 


Fro. 4.2. Illustration of the positions of six cases with respect to the arithmetic mean and 
with respect to the median. If all cases carry equal intrinsic weight, when we take into 
account their deviations they are perfectly balanced when the fulcrum is placed at the 
arithmetic mean. 


Each individual in this small distribution carries an effective weight in pro- 
portion to his distance from the mean. In the parlance of the physicist, each 
person’s distance from the mean is called a moment. In Statistics, also, we 
often speak of moments ina similar sense. In column 3 of Table 4.6, each of 
the six moments for this small distribution is given. They are more com- 
monly called deviations from the mean, or simply deviations. The size of each 
deviation indicates how much effective weight the moment carries, and its 
algebraic sign tells in what direction that weight is applied. The algebraic 
sum of these moments is zero, as it always is when the arithmetic mean and 
the deviations are correctly computed. This is simply another indication 
that the mean is a center of gravity, for the positive and negative moments 
about the mean are perfectly balanced, 

The arithmetic mean is the only value in a distribution from which the 
deviations always sum algebraically to zero. To show that the median does 
not qualify in this respect, let us find the deviations of the six scores from the 
median and sum them (see Table 4.6). The algebraic sum of the deviations 
from the median is 9.0. This means a net balance of nine units on the plus 
side. A fulcrum placed at the point 4.5 on the scale would be seriously over- 
balanced toward the end with the high scores. This comes from the fact that 
in computing a median we ignore the distance of each case from the central 
value. If we want the bar to balance when the fulcrum is placed at the 
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median value, we shall have to rearrange the cases, treating all cases above 
the median as if they had the same value and all cases below the median as if 
they also had the same value and a value as far below the median as the 
above-median group was placed above it. 

Not only are the deviations from the mean balanced about it but they have 
another important property. If we square each deviation, we have the 
squared moments about the mean. The peculiarity of the mean is that the 
sum of the squared deviations about it is smaller than that for the squared . 
deviations about any other value. In most of the following chapters we shall 
be concerned with squared deviations from the mean. For the present, it is 
merely significant to point out that when squared deviations are considered 
the arithmetic mean is closest to the measurements of the sample as a whole. 


A B 
Mean | Mode Mode | Mean 
Median Median 


Fic. 4.3, Two skewed distributions, A skewed negatively and B skewed positively, showing 
the relative positions of mode, median, and mean in each distribution. Note that the 
mean is displaced farther from the mode toward the skewed end of the distribution and that 


the median is displaced two-thirds as far. 


In Table 4.6 we can see that for this small sample the sum of squared devia- 
tions is much smaller when the reference point is the mean than when it is the 
median, the two sums being 88 and 101.5. The reader may verify the fact 
that 88 is the smallest possible sum of squared deviations in this sample by 
arbitrarily choosing other values as possible points of central value. 

Central Values in Skewed Distributions. In skewed distributions, the 
mean is always pulled toward the skewed (pointed) end of the curve, as Fig. 
4.3 shows. The arithmetic mean, as the center of gravity of the distribution, 
is weighed toward the extreme values, as was demonstrated above. The sum 
of the deviations on the one side of it equals the sum of the deviations on the 
other side. The median comes at a point that divides the area under the 
distribution curve into two equal parts. The number of scores on the one 
side of it equals the number of scores on the other. The interpretations of 
mean and median should be made accordingly. For example, for the data on 
class size in Table 4.3, the median of 18.5 tells us that half of the classes had 
19 or more students enrolled and half of them had 18 orless. The mean class 
size, which is 19.1, tells us that if all the enrolled students had been reappor- 
tioned so as to make all classes the same size, the enrollment in each class 
would have been 19.1, or 19, with a few students left over. 


* 
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When the Mean Is Misleading. In some instances, to give the mean of a 
distribution only is highly misleading; for example, in a study of class size in a 
certain university, among 62 classes, there were two classes having more than 
200 students, and two having between 100 and 200 students, all the remaining 
classes except two being smaller than 60. The average size of the 62 classes 
was 34, but this was not very typical, because half of the classes had 20 or less 
(the median was 20.5). The most typical size of class would be given as the 

mode, which was 17 (crude mode). If our purpose happened to be to equalize 
the size of classes, assuming that this were practical, we could conclude that 
there would be 34 students per class. If we wanted to decide as a matter of 
educational policy whether or not there were too many small classes in general 
and if we had concluded beforehand that most teachers can successfully 
handle 30 students in a group, then the median would tell us, without knowing 
anything more about the distribution, that there were entirely too many 
small classes. The mean would not have told us this, because it was higher 
than 30. If we were piloting a visiting inspector about the buildings while 
classes were in session and wished to prepare him for the most likely size of 
class he would find at random, we should give him the mode, since this size 
is more likely to occur than any otheronesize. If we were purchasing equip- 
ment to suit classes of various sizes, we should adapt it, if necessary, most 
often to classes of modal size, though in this case we should also want to know 
more about the entire frequency distribution. 

Mean and Median Often Both Reported. In reporting upon central values 
of skewed distributions, it is usually well to state both the mean and the 
median, since each tells its own Story, and from the difference between the 
two we can immediately infer in what direction the distribution is skewed 
and about how strongly. Although the mode is easily and quickly deter- 
mined and will often serve until better averages can be computed, it should 
probably never be reported alone and need not be reported with the other two 
averages except when it is meaningful to do зо. When a distribution is sym- 
metrical about the mode, the three averages will coincide, and so only one of 
them, preferably the mean, need be reported, together with the fact that the 
distribution is symmetrical. 

When the Median Is Especially Called For. There are one or two kinds of 
distribution in which the median is the only satisfactory average. 

Distributions with Indeterminate Values. There are some distributions in 
which some of the extreme values are not accurately determined. We know 
that they lie out beyond a certain point on the scale but we do not know just 
how far. In certain work-limit tests, for example, some subjects would work 
on for unusual lengths of time if permitted to do so. Suppose that all those 
who work on a certain test up to 10 min. are arbitrarily stopped. They are in 
the minority, and so a median can be found. Time spans up to 10 min. may 
be classified, as usual, into chosen class intervals. From 10 min. up, we find 
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the laggards grouped together. We do not know just how long they might 
have kept working had we let them continue. An arithmetic mean cannot be 
determined here, but median and mode can still be utilized. 

A Summary of When to Use the Three Averages. In brief, the following 
rules will generally apply: 


1. Compute the arithmetic mean when 

a. The greatest reliability is wanted. It usually varies less from sample 
to sample drawn from the same population. 

b. Other computations, as finding measures of variability, are to follow. 

c. The distribution is symmetrical about the center, particularly when 
it is approximately normal. 

d. We wish to know the “center of gravity” of a sample. 

2. Compute the median when 

a. There is not sufficient time to compute a mean. 

b. Distributions are badly skewed. This includes the case in which one 
or more extreme measurements are at one side of the distribution. 

c. We are interested in whether cases fall within the upper or lower 
halves of the distribution and not particularly in how far from the 
central point. 

d. An incomplete distribution is given. 

3. Compute the mode when 

a. The quickest estimate of central value is wanted. 

b. A rough estimate of central value will do. 

c. We wish to know what is the most typical case. 


MEANS IN SOME SPECIAL: SITUATIONS 


The measures of central value described thus far will take care of the great 
majority of situations in which such statistics must be computed. There are 
some problems, which, though rare, require other treatment. Four of these 
will be briefly mentioned: means of arithmetic means, means of percentages 
(and proportions), geometric means, and harmonic means. 

Finding Means of Arithmetic Means. When one has the means of several 
samples, presumably from the same population, on the same test or scale, he 
may want to know the over-all mean for the samples combined. At first 
thought, it might seem appropriate simply to average the several means just 
as one would average single observations. This would be proper procedure 
provided the samples are of the same size. If the N’s in the samples differ, 
however, the means are not equally reliable. 

In order to extract the best information about the central value of the 
entire sample, we should weight each mean according to the number of cases 
in the sample from which it was derived, for a mean’s reliability is in propor- 
tion to the size of sample. This procedure is equivalent to pooling all the 
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single measurements from the different samples and computing a single over- 
all mean. We can accomplish the same end by computing a weighted mean 
of the means, which we already know. The general formula for computing 
a weighted mean is 


„М = ^W (A weighted arithmetic mean) (4.9) 


where „М = weighted mean 
W = weight 
ZWX = sum of the values being averaged, each multiplied by its appro- 
priate weight 
ZW = sum of the weights 

Table 4.7 illustrates the application of this formula. In the problem repre- 
sented there, four means differing considerably had been derived from sam- 
ples ranging from approximately 400 to approximately 2,700 cases each.! 


TABLE 4,7. COMPUTATION ОР A MEAN OF ARITHMETIC MEANS, WITH AND WITHOUT 
WEIGHTING THE SAMPLES* 


a) (2) (3) (4) 
Number in the} Mean of the 
Weighted mean 
Group sample sample 
V. = W M. = X ММ; = WX 
15 25.6 384.0 
27 31.3 845.1 
9 38.7 348.3 
4 32.5 130.0 
55 = zW 128.1 = zW 1,707.4 = WX 
ЕРИ 32.0 = M, 31.0 = „М. 
a SS I Pp cm Se re eee T UCM 


* The samples were of scores on a perceptual-speed test administered to aviation students and other 
military personnel. The sizes of samples were approximately 100 times the values given. Rounding 
was done to simplify the illustrati lt probably did not affect the size of the weighted mean materi- 
ally. Ns is the number of cases in sample J, and Mi the mean of sample 1. 


The unweighted mean of these four means would be 32.0, whereas the 
weighted mean is 31.0. The latter is much more representative of al] the 
individuals in the combined sample. 

When the means to be averaged are very close together, as they will ordi- 
narily be when samples are drawn from the same population and are not too 
small, and when the N’s do not vary much from sample to sample, the 
weighted and unweighted means will be very close together. In certain 


The means are so different and samples are so large that it is highly unlikely that the 
samples came from the same population. They will serve to illustrate the procedure 
nevertheless, 
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situations, then, ће unweighted mean may be reported. But if the com- 
posite mean is to be used for further computations, in which case it should 
often be estimated to the second decimal place, weighting certainly is called 
for. 

The Mean of Percentages or of Proportions. The weighting procedure 
just described is even more important in determining the mean of a series of 
percentages or of proportions. Table 4.8 illustrates this point. The data 
in that table have to do with the percentage of pilot students eliminated in 
certain schools during one training period. Had the schools had the same 
enrollment, or even very nearly the same, the unweighted mean would 
suffice. Since the largest class is nearly four times as great as the smallest, 
however, and since elimination rates vary from 3.3 to 27.2, there is a marked 
difference between weighted and unweighted means. If we wished to know 
the over-all elimination rate in order to make decisions for some administra- 


TABLE 4.8. COMPUTATION OF AN AVERAGE PERCENTAGE* 


A x 


34 


Паи epee ge eed prs PP анаа 
688 = ZN, | 141 = ZN;P;/100 | 86.1 = ХР; 
Medi 137.6 = My 17,2 = Met 


* The data represent students enrolled in five AAP pilot schools selected to illustrate this procedure. 
+ The weighted menn of the percentages equals 14,100/688 = 20.5. The value 17.2 is the unweighted 


tive purpose, the unweighted mean would be misleading. Certainly, when 
the percentage or the proportion in a composite is wanted for further compu- 
tations, the weighting procedure is essential, unless the sample 's are exactly 


equal. 
In terms of a formula, the weighted mean of a percentage is 


„М„ = “зу, (Mean of percentages where N's differ) (4. 10) 
i 


where N; = number in each sample 
P; = percentage for each sample 
ZN,P, = sum of products of each percentage times its corresponding № 
EN; = sum of the sample N's 


72 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION len. 4 


A completely analogous formula applies to finding the weighted mean of pro- 
portions, in which case № is substituted for P. 

The Geometric Mean. The arithmetic mean of two numbers is found by 
adding them and dividing by two. The geometric mean of two numbers is 
found by multiplying the two numbers and then taking the square root, The 
arithmetic mean of 2 and 18 is 10.0. The geometric mean is 


М2 Х 18 = 4/36 = 6.0 


The geometric mean of three numbers is the cube root of their product; of 
four numbers, the fourth root of their product; and so оп. In terms of a 
general formula, 


GM = XX XX Xs X X Ху aa (4.11) 


where GM = geometric mean 
Xi, Xi, . . , Хк = series of measurements 
N = number of measurements 
When there are more than two measurements to be averaged in this manner 
the computations become bothersome, unless we resort to the use of loga- 
rithms, . The students of mathematics will recognize that if we take loga- 
rithms of both sides of formula (4.11) we obtain the equation 


log GM = 200 (Logarithmic solution of geometric mean) (4.12) 


Tn other words, the steps called for are as follows: 


Step 1. Convert each X into a corresponding log X, by using Table K, 
Appendix B. 

Step 2. Sum the log X values. 

Step 3. Divide this sum by V. This result is the logarithm of the geometric 
mean, as shown by formula (4.12). 

Step 4. Find the antilogarithm of the value obtained in step 3. This is the 
Beometric mean, 


These steps are illustrated in Table 4.9. 

One of the instances in which the geometric mean applies in psychology is 
in the averaging of stimulus values in psychophysics, when those stimulus 
values are used to indicate psychological quantities rather than physical 
quantities, The data in Table 4.9 are fictitious and were invented to illus- 
trate a point. Let us suppose that an observer with very poor discriminative 
power were asked to control a sound-generating instrument so as to produce a 
sound matching in loudness a tone that he has just previously heard. On 
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five different trials the readings of his settings might be as given in column 2 
of Table 4.9. We want to find his average setting. 

The arithmetic mean, as shown in column 2, would be 12.2 units. Accord- 
ing to what we know about psychophysical relationships this would be incor- 
rect. Weare really interested in the mean of his sensory responses, the loud- 
ness of the tones that he hears. We assume these to lie on a psychological 


TABLE 49. COMPUTATION OF A GEOMETRIC MEAN OF TONES MATCHED FOR LOUDNESS 
TO A STANDARD TONE 


[D] (3) 
d Stimulus | Logarithm of the 
TEM (S) stimulus (log S) 
1 14 1.1461 
2 8 0.9031 
3 22 1.3424 
4 7 0.8451 
5 10 1.0000 
$ишшв............ 61 5.2367 
Means. 12.2 1.0473 


Geometric mean (antilog of 1.0473) — 11.2 


— — h——ͤſal.— 


scale whereas the stimuli lie on a scale of physical energy. Let a value on the 
psychological scale be called R and dne on the physical scale be called S. 
From Fechner’s psychophysical law, the relationship of R to S is usually 
stated in the equation R = C(log S). Strictly speaking, the S values should 
be expressed as multiples of the stimulus limen, but that need not concern us 
particularly here. We may assume that the 5 values in column 2 are multi- 
ples of the threshold stimulus. 

In this connection the reader may be reminded of the decibel scale for loud- 
ness of sounds. ‘The decibel-scale values are proportional to the logarithms 
of the stimuli. Ten decibels represent a stimulus 10 times as strong physi- 
cally as the threshold stimulus; 20 decibels one 100 times as strong; 30 decibels 
1,000 times, and so on. The physical values increase in a geometric series 
while the psychological values are assumed to progress in a parallel arithmeti- 
cal series. 

To return to Table 4.9, the logarithms of S are found in column 3. Their 
sum is 5.2367, and their mean is 1.0473. The antilogarithm of this value is 
11.2, which is the geometric mean. It will be seen that this value is 1.0 unit 
smaller than the arithmetic mean of the same stimulus values. We would 
conclude that for this observer the stimulus that for him seems most equiva- 
lent to the standard sound is one of 11.2 units. 
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When to Use the Geometric Mean. Probably the most common use of the 
geometric mean in psychology has already been illustrated, namely, in psycho- 
physics! There are other places in which it may well be preferred, for exam- 
ple, in many instances in which time measurements are used, including reac- 
tion-time measurements. The need for a geometric mean may be indicated 
when distributions are distinctly positively skewed. It is best, however, to 
look for some rational basis, such as the existence of geometric series, before 
deciding to compute this kind of mean. A rate-of-growth measurement, for 
example, often involves a geometric series. An important limitation is that 
a geometric mean cannot be computed when any measurement in the dis- 
tribution is zero or negative. 

Harmonic Mean. Like the geometric mean, the harmonic mean is needed 
because the measurements were not made on an appropriate scale. A com- 
mon application for it is in connection with “work-limit” tests. In such 
tests the score is the amount of time required to complete a fixed quantity of 
work. The frequency distribution of such scores is often positively skewed. 
Such tests, if given in the more usual form of time: limit“ tests, would yield 
scores in terms of units of work accomplished in a fixed time. The frequency 
distributions of such scores more commonly approach symmetry. If the 
ability or abilities measured are assumed to be normally, or at least symmetri- 
cally, distributed in the population from which the sample came, it is reason- 
able that the time-limit score is more representative than the work-limit 
Score, representative in the sense that it spaces individuals better along a 
scale of equal units of ability. - 

The harmonic mean (HM) is defined as the reciprocal of the mean of the 
reciprocals of the measurements. The formula is 


E - H (ў x) (Equation defining a harmonic mean) (4.13) 


A formula for computing the HM is 


N 
HM = ST (Computing formula for the harmonic mean) (4.14) 


X 


As in the case of the geometric mean, the harmonic mean cannot be com- 
puted when any X is zero or negative. 


Exercises 


1. Compute the arithmetic mean of any or all distributions in Data 44 to 4F inclusive, 
using the method that seems most feasible. In Data 4E, you will need to make some 
assumption about the cases in the two highest intervals. State your assumptions if means 
are computed for these distributions. 


See Guilford, J. P. Psychometric Methods. 2d ed. New York: McGraw-Hill, 1954. 
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Data 4А. SCORES IN AN ENGLISH-USAGE 


75 


Data 4B. Arrectivity SCORES 


EXAMINATION (Per cent of 400 words marked pleasant) 
Scores Í Scores ig 
52-53 1 95-99 6 
50-51 0 90-94 11 
48-49 5 85-89 16 
46-47 10 80-84 7 
44-45 9 75-19 9 
42-43 14 70-74 8 
40-41 7 65-69 2 
38-39 8 60-64 3 
36-37 6 55-59 2 
34-35 5 50-54 1 

‚ 65 
32-33 3 
Sum. e 32-008 


Data 4C. Scores MADE BY GRADUATES 
AND ELIMINEES IN THE COMPLEX COORDI- 
NATION TEST BY STUDENT PiLOTS 


— —— — —-— 
Frequencies 
Scores 
Graduates Eliminees 

95-99 1 

90-94 1 

85-89 7 1 
80-84 13 2 
75-19 37 6 
70-74 75 23 
65-69 189 34 
60-64 297 94 
55-59 406 144 
50-54 425 208 
45-49 341 209 
40-44 174 205 
35-39 81 105 
30-34 16 34 
25-29 5 15 
20-24 0 2 
15-19 1 


Data 4D. Scores IN AN ADJUSTMENT 
INVENTORY OBTAINED FROM ALCOHOLICS 
AND NoNALCOHOLICS or Born Srxrs* 


Frequencies 
" Males Females 
Scores 
NER bea ca Айдон Cs 
holics te holics асра 
holics holics 
66-71 1 
60-65 6 3 
54-59 13 1 2 1 
48-53 13 1 10 2 
42-47 17 3 11 1 
36-41 33 3 12 1 
30-35 32 2 8 8 
24-29 32 9 11 17 
18-23 23 16 5 26 
12-17 24 36 2 40 
6-11 7 43 2 49 
0-5 1 25 21 


* Manson, M. P. A psychometric differenti. 
ation between alcoholics and nonalcoholics. 
se Quar. J. Stud. Alcohol., 1948, 9, 175-206. 
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Data 4E. AGES ОР COLLEGE FRESHMEN DATA F. AIMING-TEST SCORES 
CC (In terms of average error in millimeters) 
Age at last 
birthday e orien Score Men Women 
8.0-8.4 1 
7.5-7.9 5 
7.0-7.4 2 
6.5-6.9 7 2 
6.0-6.4 6 4 
5.5-5.9 11 3 
5.0-5.4 10 9 
4.54.9 16 7 
4.0-4.4 18 15 
3.5-3.9 19 12 
3.0-3.4 17 15 
2.5-2.9 17 13 
2.0-2.4 14 14 
1.5-1.9 13 10 
1.0-1.4 8 1 
0.5-0.9 1 
Sums......... 165 105 


2. Compute medians for any or all distributions in Data 4A to 4F inclusive. Why 
is the difficulty experienced with computation of the mean in Data 4E not also encoun- 
tered in computing the median? 

3. Give the crude modes for all distributions in Data 44 to4F. Compute the estimated 
mode in distributions for which you know both mean and median. 

4. Compute and list the means, medians, and crude modes (where possible) for the 
distributions in Data 4G. 


Data 4G. Some UNGROUPED DATA 
в. 8, 15, 13, б, 10, 16, 7, 12, 11, 14, 9 
b. 12, 10, 18, 13, 4, 8, 17, 15, 6, 14 
c. 9, 8, 9, 15, 3, 9, 11, 9, 13 

d. 12, 28, 19, 15, 15, 35, 14, 15 

e. 7, 18, 20, 14, 27, 23, 13, 3 


5. For each distribution in Data 4G, tell to which measure of central value you give first 
preference and to which, second. Give reasons, 

б. For each distribution in Data 44 to 4F inclusive, tell which measure of central value 
you would prefer and which would be your second choice. Give reasons. 

7. Find the weighted means of the four means: 15, 16, 18, and 21. These means were 
derived from samples in which the N's were 6, 10, 25, and 20, respectively. Compute 
the unweighted arithmetic mean of the four, for comparison. Interpret your result. 

8. Find the weighted mean of the Proportions .25, .30, .32, and .33. These propor- 
tions were based upon samples whose N’s are 44, 32, 18, and 25, respectively. Compute 
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an unweighted arithmetic mean of these proportions, for comparison. Interpret you! 
results. 
9, Find the geometric mean of the numbers 2, 9, 15, and 16. Compute the arithmetic 
mean, for comparison. Interpret your results. 
10. Find the harmonic mean of the work-limit scores 20, 25, 40, and 50, These scores 
represent the total time summated in a series of 120 simple reaction times and are in terms 
of seconds. Interpret yqur results. 


Answers 
1, 2, and 3: 


Data | 44 4B 4C 4D 4E 4F 


7. 184; 17.5. 
8. .291; .300. 
9. 8.1; 10.5. 
10. 29.6. 


CHAPTER 5 


MEASURES OF VARIABILITY 


Knowing the central value of a set of measurements tells us much, but it 
does not by any means give us the total picture of the sample we have meas- 
ured. ‘Two groups of six-year-old children may have the same average JỌ of 
105, from which we would conclude that, taken as a whole, each group is as 
bright as the other, and we might expect from the two the same average level 
of performance in school or out of school in areas of life where ЈО is important. 

Yet when we are told, in addition, that one group has no individuals with 
ZQ’s below 95 or above 115, whereas the other has individuals with IQ's 
ranging from 75 to 135, we recognize immediately that there is a decided 
difference between the two groups in variability or dispersion of brightness. 
The first group is decidedly more homogeneous with respect to JQ, and the 
second is decidedly more heterogeneous. We should expect the first group 
to be much more teachable in that they will grasp new ideas at about the 
same rate and progress at about the same rate. We should expect the second 
group to show considerable disparity in speed of grasping new ideas. There 
will be extreme laggards at the one end of the distribution and others at the 
other end of the distribution who may be irked at the slow progress of the 
group. The distributions for two such groups, when plotted, resemble those 
in Fig. 5.1. 

It is the purpose of this chapter to explain and illustrate the methods of 
indicating degree of variability or dispersion by the use of single numbers, 
just as in the preceding chapter we saw how the central value of a distribution 
could be indicated by a single number. The four most customary values to 
indicate variability are (1) the total range, (2) the semi-interquartile range 
0, (3) the standard deviation v, and (4) the average (or mean) deviation 4 D. 


THe ToTAL RANGE 


The total range is the indicator of variability that is easiest and most 
quickly ascertained but is also the most unreliable; thus it is almost entirely 
limited to the purpose of preliminary inspection. In the illustration of the 
preceding paragraph, the range of the first group (from an JQ of 95 to an JO 

The probable error PE has been used as a measure of variability, but it has almost 
entirely gone out of use. 

78 
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of 115) was 21 JỌ points inclusive. The range of the second group was from 
75 to 135 IQ points. The range is the distance given by highest score minus 
lowest score, plus 1. From this comparison, we draw the conclusion that the 
second group is considerably more variable than the first. 

Why the Range Is Unreliable. The range is very unreliable for the reason 
that only two measurements are used to determine it. The remaining meas- 
urements have nothing to do with the estimation of it. In the second group 
just mentioned, it might have been true that there were several /Q’s of 7$ and 
also several 7075 of 135; but this would be most unusual. The chances are 


15 85 95 105 115 125 155 
IQ 


Fro. 5.1. Two distributions with the same mean (/Q = 105) but with decidedly different 
ranges (and dispersions). 


great that there would be only one 75 and one 135. Furthermore, the next 
lowest JQ might have been 85, with a gap of 10 points to the very lowest; and 
the next to the highest might have been 120, a distance of 15 points from the 
very highest. Had either or both of the persons with 75 /Q and 135 /Q been 
missing from the group, the range would have been something very different 
from the 61 points actually obtained. This is what we mean by saying that 
the total range is highly unreliable. Some faith can, of course, be placed in 
it when there is more than one case having each of the extreme measurements 
and when there are no decided gaps in the tails of the distribution. 

When Ranges Should Not Be Compared. Total ranges should not be 
compared when two distributions have a markedly different number of cases. 
It is quite natural for more extreme cases to show up as we add new cases to 
any sample, so that larger groups should be expected to have wider total 
scatter. This factor is not nearly so important for other indicators of dis- 
persion as it is for total range. Another caution almost goes without saying, 
and that is the impossibility of comparing ranges in two distributions where 
the units of measurement are not the same. 
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THE SEMI-INTERQUARTILE RANGE—(. 


The semi-interquartile range, Q, is one-half the range of the middle 50 per 
cent of the cases. First we find by interpolation the range of the middle 50 
per cent, or interquartile range, then divide this range by 2, See Fig. 5.2 for 
a general picture of the relation of Q to a frequency distribution. 

Quartiles and Quarters. When we count up from below to include the 
lowest, or first, quarter of the cases, we find the point called the first quartile, 
which is given the symbol Oi. Counting down from above to include the 
highest, or fourth, quarter of the cases, we locate the third quartile, or Os. 

Line erected ie erected af the 


at the first second quartile (Q; 
quartile (© (also the medion} / 


Line erected af the 
third quartile (Qs) 


— —ę—ĩʒꝛ 
1@-@ 9-9 1 
9-0=29 | 


a 
Fic. 5.2. Illustration of the quartiles Qi, Qs, and Qs, the interquartile and semi-inter- 
quartile ranges, and the quarters of the sample in a slightly skewed distribution. 


Incidentally, the median, which separates the second and third quarters of 
the distribution, is also called O2. Note that the quartiles Q;, Os, and Qs are 
points on the measuring scale. They are division points between the quarters. 
We may say of an individual that he is in the highest quarter (or fourth quar- 
ter), and we may say of another that he is af the third quartile. We should 
never say of an individual that he is in a certain quartile, i 
Interpolation of Qı and Qs. In the distribution of ink-blot scores again, 
we locate the third and first quartiles by interpolation (see Table 5.1). 
One-fourth of the cases (N/4) is 12.5. Counting up from the bottom to 
include 12.5 cases, we find that we need 2.5 out of the 6 cases in the third class 
interval. As in earlier Solutions, 2.5/6 times 5 gives 2.08. Added to 19.5, 
this gives 21.58 as the position of Q}. Counting down from the top, we find 
that we need 3.5 cases out of 6 in the fifth class interval. Then 3.5/6 of 5 
gives 2.92. Deducted from 39.5, this leaves 36.58 as our estimate of Qs. 
The Interquartile Range and 0. The interquartile range, or the distance 
from Q, to Qs, is given by Qs n or 36.58 — 21.58, which equals 15,00. 


ca. 5] MEASURES OF VARIABILITY 81 


TABLE 5.1. DETERMINATION OF Qs, Qi, and Q (THE SEMI-INTERQUARTILE RANGE) 
FOR THE JNK-BLOT-TEST SCORES 


Scores. f 
55-59 1 
50-54 1 
45-49 3 
40-44 4 
35-39 6<; lies within this interval 
30-34 7 
25-29 12 
20-24 6—0, lies within this interval 
15-19 8 
10-14 2 
N = 50 
Oi = 19.5 + 25 Х 5 = 19.5 + 2.08 = 21.58 


0, = 39.5 — 35 X 5 = 39.5 — 2,92 = 36.58 


Q= 36.58 1 2158 _ 1500 =75 


The semi-interquartile range is one-half of this, or 7.5. In terms of a formula, 


Q= $79 (Semi-interquartile range) (5.1) 


where 0 = third quartile and Q, = first quartile. 

How Quartiles Indicate Skewness. It is of interest in passing to take note 
of the relative distances of Q; and Q; from the median, or Qs, in a distribution. 
If the distribution is exactly symmetrical, both the third and first quartiles 
will be the same distance from the median, and that distance is Q. When 
there is any skewness in the distribution, the two distances will be unequal. 
If the skewness is positive, the distance Qs — Cs will be greater than the 
distance 0 — Qı. If the skewness is negative, the reverse will be true. In 
other words, skewness is: 


positive when (0: — Q3) > (0: — 01) 
negative when (0; O:) < (Q2 — Qi) 
and zero when (0; — 02) = (0: — Q1) 


The relative sizes of these two distances therefore tells much about the direc- 
- tion and the amount of skewness in the distribution. For the ink-blot scores, 
Q: — Qs is 8.4, and 0: — Qi is 6.6. Our inference is that the distribution is 
positively skewed to a moderate degree. In Fig. 5.2 the distribution is posi- 
tively skewed and (Qs — 0з) is greater than (0 — QJ. 
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Tur AvERAGE DEVIATION 


The average deviation, or AD, is the arithmetic mean of all the deviations when 
we disregard the algebraic signs. Every score or measurement in a distribution 
deviates from the mean in that it is a certain distance above or below the 
mean. When and if any measurement coincides exactly with the mean, its 
deviation is zero. Deviations above the mean are regarded as positive dis- 
tances, those below the mean as negative distances. In terms of an algebraic 
definition, 


* NX NM (A deviation of a measurement from the mean) (5.2) 


where X — an original score or measurement and M — the arithmetic mean. 

As was pointed out in a previous chapter, the deviations from the mean 
may be regarded as moments about a center of gravity. If we sum the devia- 
tions, taking into account the algebraic signs, the sum would be zero. In 
other words, Zx — 0. The average of the deviations would also be zero, 
because Zx/N = 0/N, and zero divided by any finite number is equal to zero. 
This kind of average of the deviations tells us nothing, therefore, about their 
size. We want some indication of their over-all size in order to describe the 

* amount of dispersion. The greater the spread of the deviations, the greater 
the dispersion of the distribution. 

One solution is to disregard the algebraic signs of the deviations. In doing 
80, we disregard their direction; we are interested only in their amount. We 
treat them as if they were all positive. In terms of a formula, 

9 la ihe 
AEN (The average deviation) (5.3) 


where |x| (with the vertical bars embracing it) = an absolute value of x, i.e., 
disregarding algebraic sign. 

To illustrate the solution of an average deviation, consider Table 5.2. The 
sum of the absolute deviations is 18.8. Divided by N, this gives 1.88 as the 
average deviation. Because of the small size of N, we should round to one 
decimal place and give the AD as 1.9. 

Interpretation of an Average Deviation. From the formula and the compu- 
tations it will be seen that when we compute the average deviation we are 
interested merely in the size of the deviations from the mean. We ignore 
their direction. The AD is an arithmetic mean of all the deviations of what- 
ever size or direction. Like any arithmetic mean, it stands for all the values 
averaged. In the problem just solved, the AD tells how much, on the aver- 
age, the different observations of the auditory limen differed from their mean, 
13.2. The answer is that, on the average, these deviations were 1.9 cycles, or 
a little less than 2. 
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TABLE 5.2. CALCULATION OF THE AVERAGE DEVIATION IN UNGROUPED DATA 


(Mean = 32.2) 
X lel 
13 0.2 
17 3.8 
15 1.8 
; 22 
13 0.2 
1 2 
17 3.8 
13 0.2 
11 2:2 
11 2.2 
18.8 
> |x| 
18.8 
AD = 107 1.88, ог 1.9 


In samples that are not too small and when distributions approach the 
normal bell-shaped form, we may make the further remark that about 58 per 
cent of the observations should be expected to fall within the limits 1 4D 
below the mean and 1 AD above the mean. In the threshold problem those 
two conditions are not satisfied; the distribution is neither large enough nor 
symmetrical enough to warrant such a conclusion. If this were the case, 
however, we could say that 58 per cent of the 10 measurements (six of them) 
should be expected between 13.2 — 1.9 — 11.3 and 13.2 + 1.9 = 15.1. This 
would include all integral values of 12, 13, 14, and 15. Actually, only four 
of the observations were included within those limits, though this should not 
surprise us, in view of the smallness of the sample. 

Computation of the AD from Grouped Data. Although the average devia- 
tion is not often computed for large, regular samples in ordinary statistical 
practice, it is probably worth demonstrating how this statistic can be con- 
veniently computed from data grouped in class intervals. Table 5.3 demon- 
strates this kind of solution. The mean of the 50 ink-blot-test scores repre- 
sented in Table 5.3 was previously reported as 29.60. Ordinarily, one decimal 
place (or one digit beyond the last at the right in the original measurements) 
will do in the computation of the AD. 

Column 2 of Table 5.3 presents the midpoints of the intervals. The mid- 
point value represents every measurement in the interval. Column 3 gives 
the deviations of these midpoints from the computed mean. Algebraic signs 
are recorded for the sake of accuracy, but they will not be needed in the com- 
putations. In column 5 are the products of each frequency times its corre- 
sponding deviation, in other words, each fx product. The equation for the 
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AD by this procedure is 
AD = 2e (The average deviation from grouped data) (5.4) 


where f, x, and Л are as previously defined, and the fx products are summed 
without regard to algebraic sign. From the data in Table 5.3, 


. 425.6 


50 
= 8.512 


AD 


which should be rounded to 8.5.! 
According to the kind of interpretation given previously, we may say that, 
if this distribution of scores is close to normal, we should expect 58 per cent 


TABLE 5.3. COMPUTATION OF AN AVERAGE DEVIATION IN GROUPED DATA 


(1) (2) (3) (4) (5) 
Scores X x f Sx 
55-59 | 57 +27.4 1 + 27.4 
50-54 52 +22.4 y + 22.4 

. 45-49 47 +17.4 3 + 52.2 
40-44 42 +12.4 4 + 49.6 
35-39 37 + 7.4 6 + 44.4 

30-34 32 + 2.4 7 + 16.8 

25-29 27 — 2.6 12 — 31.2 
24 22 — 7.6 6 — 45.6 

15-19 17 —12.6 8 —100.8 

10-14 12 —17.6 2 — 35.2 
Sums 50 425.6 
N ху] 


of the scores to lie between 21.1 and 38.1. This would mean 29 of the 50 
scores. Since the data are grouped in Table 5.8, we cannot check this con- 
clusion by actual count of the cases, but a rough check can nevertheless be 
made. If we assume that the six individuals in the interval 35-39 are evenly 
distributed, about four of them should be below 38.1. If we assume, like- 
wise, that the six individuals in the interval 20-24 are evenly distributed, then 
four of them should be above the point 21.1. With these assumptions made, 
there are 27 cases between the points 21.1 and 38.1. This number is 54 per 
cent of the sample. Fifty-eight per cent would have called for 29. The 
agreement may be regarded as close enough, in view of the fact that the sam- 
One check on the accuracy of computations of the Zfx values is to sum them alge- 


braically. The sum 2fx should equal approximately zero, small discrepancies due to 
rounding errors being tolerated. In Table 5.3, Zfx equals exactly zero. 
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ple is not very large and the fact that it tends to be positively skewed. Such 
à check is often sufficient to tell us whether we have made any serious errors 
in computing the average deviation by this method. 


THE STANDARD DEVIATION 


The standard deviation, ого, is the most commonly used indicator of degree 
of variability, and of the ones described in this chapter it is usually the most 
reliable. That is, it varies least from sample to sample drawn at random 
from the same population. It is therefore more dependable and, as an esti- 
mate of the dispersion of the population, it is more accurate. 

General Formula for the Standard Deviation. Like the AD, the standard 
deviation is also a kind of average of all the deviations about the mean in a 
sample, though it is not a simple arithmetic mean.! The fundamental 
formula for it is 


- [2 

T 7 
where x = deviation from the mean of the sample and N = size of the sample. 
Formula (5.5) deserves close study. It calls for several steps in fixed order: 


(Basic formula for the standard deviation in a sample) (5.5) 


Step 1. Find each deviation from the mean (x). 

Step 2. Square each deviation, finding x°. 

Step 3. Sum the squared deviations, finding Zs. 

Step 4. Divide this sum by X, finding Za?/N. 

Step 5. Extract the square root of the result of step 4. This is the standard 
deviation.” 


Variability, Variance, and Sum of Squares. Before proceeding to apply 
the formula, let us consider some important concepts. In verbal terms, a 
standard deviation is the square root of the arithmetic mean of the squared 
deviations of measurements from their mean. It has often been called the 
root-mean-square deviation. But in this simplified statement lies considerable 
meaning. Latent in the few steps enumerated above lie two statistical con- 
cepts that have increasing importance. One is the sum of squares, the end 
result of step 3. The other is called variance, the end result of step 4. These 
ideas are best introduced by means of an illustration. 


In some textbooks the standard deviation of a sample is symbolized by the double 
lettering SD, or S.D. In some others it is denoted by the letter s. The symbol s, however, 
stands for an estimate of the standard deviation of the whole population from which this 
particular sample came, and it would be computed by using N — 1 in place of N in formula 
(5.5). When N is large (30 or greater), « and s are practically identical. See Chap. 9 for 
further information on the sample and population standard deviations. 

2 These steps are illustrated in Tables 5.4 and 5.5 and in Fig. 5.3. 
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TABLE 5.4. DATA ILLUSTRATING SUM OF SQUARES, VARIANCE, AND STANDARD DEVIATION 


а) (2) (3) (4) 
Person Score Deviation uisus 
x * e 
A 15 T5 25 
B 14 T4 16 
[^ 1 +1 1 
D 10 0 0 
E 9 ~t 1 
F 7 —3 9 
G 4 —6 36 
0 = =x 88 = Dx? 
Г? TM 0.0 12.57 = V 
Standard deviation. paii 


In Table 5.4 are listed seven fictitious scores representing a sample of seven 
individuals A to G inclusive. These are denoted by the usual symbol, X. 
The mean of these seven scores, as shown in column 2, is exactly 10.0. 
Column 3 shows the deviations of these scores from the mean. Their sum is 
zero and also their mean, as is to be expected. In column 4 we find the 
squared deviations. Their sum, 88, is the sum of squares. Their mean is 
equal to 12.57, which we have defined as the variance, in this sample. The 
square root of this is 3.55, the standard deviation. All this follows from 
formula (5.5) an m the steps and definitions given above. Let us see 
what this means in terms of a geometrical view of the problem. 

A Geometric Picture of Deviations, Variance, and Standard Deviation. For 
a geometrical representation of these ideas, see Fig. 5.3. In the first diagram, 
the scale of measurement is shown, as usual, in the form of a straight line 
extending from left to right. Here, however, the original score values are not 
marked. The mean has become recognized as the main reference point and 
has been called zero. This is what happens when we derive deviations x from 
original scores X. All seven individuals still retain their relative positions, in 
correct rank order and at the same separations, as they had before. We have 
merely moved the zero point 10 units up the linear scale, 

So much for representing deviations. It will be seen that the points on the 
line correspond exactly with the values in column 3 of Table 5.4. Consider 
now the squaring of the deviations. Where deviations themselves are repre- 
sented by linear distances from a common reference point, squared deviations 
must be represented by areas, namely, squares. The squares belonging to 
the different individuals A to G are shown in Fig. 5.3. The areas of the 
squares are equal numerically to the values given in column 4 of Table 5.4. 
Tt can be seen that the individuals come-in the same rank order when we 
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compare the squared deviations as when we compare x distances. It is also 
notable how large deviations, when squared, increase much more relatively 
than do small deviations. This point will be important to consider later. 
The sum of the squares would be represented geometrically as an area 
equal to a composite of all the squares in Fig. 5.3 I. This could also be shown 


-2 = + +2 +3 +4 +5 46 
Deviations from the mean 


-6 -5 -4 -3 


* 

T Variance 
„ 
Standard deviation 


Fic, 5.3. Illustration of deviations from the arithmetic mean, their squares, the mean of the 
squares (which is the variance), and the standard deviation (which measures the variability) 


in a sample of seven cases. 


as a square or as a rectangle. Its dimensions could vary somewhat but its 
surface would contain 88 units such as those representing persons C and E. 
Finding the arithmetic mean of this large area is equivalent to apportioning 
it equally among the seven individuals. It is the amount of area that each 
person would possess if each one of them were given the same amount. This 
is the variance, which we may represent in the form of a square in Fig. 5.3 TI. 
This square is shown on a base line like that in the first diagram/ Its length 
of side is the square root of its area and represents the standard deviation. 
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Algebraic Inlerrelationships of S, V, and ø. Some important algebraic 
relationships, latent in formula (5.5), may be called to the attention of the 
reader. They are all important for general orientation in this topic. They 
may be useful not only in thinking about the concepts of sums of squares, 
variances, and standard deviations but will be found to enter into computa- 
tions of various kinds later. First, two more symbols need to be introduced. 
V is used to stand for variance. With this additional symbol given, we can 
state the following interrelationships: 


EX 53 
m NL ЕЕ (5.6) 
yz x =g? (Interrelationships of Za, V, and о) (5.7) 

Dr = NV = No? (5.8) 


Both V and c, each in its own way, are indicators of amount of dispersion 
іп a distribution, V is said to measure variance, с to measure variability. 
When the sample is one of individuals measured on a common scale, either 
V or g can become familiar indicators of the extent of the individual differ- 
ences. To make these concepts more meaningful, then, it is well to think of 
them in terms of measures of individual differences. 

Further Interpretations of Variance. Suppose, first, that we have a sample 
of only one case, with only one score. There is no possible basis for individual 
differences in such a sample, and therefore there is no variance or variability. 
Bring into the picture a second individual with his score in the same test or 
experiment. Wenow have one difference, Bring in a third case and we then 
have two additional differences, three altogether. Bring in a fourth, a fifth, 
and во on. There are as many differences as there are possible pairs of indi- 
viduals. We could compute ай] these interpair differences and could average 
them to get a single, representative value. We could also square them and 
then average them. It is far more economical, however, to find a mean of all 
the scores and to use that value as a common reference point. Each differ- 
ence then becomes a deviation from that reference point, and there are only 
‘as many deviations as there are individuals, Either the variance or the 
Standard deviation is a single representative value for all the individual differ- 
ences when taken from a common reference point. 

Consider the matter from a somewhat different point of view. Consider 
giving a certain test of э items to a group of persons. Before giving the first 
item to the group, so far as this test is concerned the individuals are all alike. 
All have scores of zero. There is no variance. This may seem absurd, but 
it has a very reasonable bearing on what comes next. Next administer the 
first item in the test to all individuals in the group. Some will pass it and 
some will fail. Some will now have scores of 1 and some still have scores of 

Zero. There are two groups of individuals. There is this much differentia- 


сн. 5] MEASURES OF VARIABILITY 89 


tion, this much variance. Give a second item. Of those who passed the 
first, some will pass the second and some will fail it, unless the two items are 
perfectly correlated. Of those who failed the first, some may pass the second 
and some may fail it. There are now three possible scores, O, 1, and 2. More 
variance has been introduced. Carry the illustration further, adding item 
by item. The differences among scores will keep increasing, and so, by 
computation, also the variance and the variability, as indicated by V and 
by о. 

Psychological and educational testing depends almost entirely upon the 
phenomenon of individual differences and therefore upon variance. Probably 
less than 1 per cent of the tests commonly used yield scores on an absolute 
scale. The significance of any score is ordinarily its usefulness in placement 
of a person somewhere in the group. ‘The greater the variance among the 
scores, other things being equal, the more accurately each person is placed. 

Tn addition to the use of the variance and standard deviation in describing 
the spread or scatter of a certain sample, there is use, as we shall see in later 
chapters, in the evaluation of tests and test items in a number of ways (see 
Chap.17). After this digression, let us return to the descriptive use of т and 
its computation in a typical laboratory problem. 

Computation and Interpretation of a Standard Deviation. As an illustra- 
tive problem in computing « by formula (5.5), let us take the 10 measure- 
ments of the threshold for pitch (see Table 5.5). Their mean we found to be 


TABLE 5.5. CALCULATION OF THE STANDARD DEVIATION IN UNGROUPED DATA " 


(0 (2) (3) ò 


x x 
Scores | Deviations “ 

13 -0.2 -04 

17 +3.8 14.44 

15 +1.8 3.24 

11 —2.2 4.84 

13 —0.2 04 . 
E 
.04 
4 
E 


7 2 510 = М®1@ = 221,023 


13.2. The deviations from the mean are given in column 2 and their squares, 
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in column 3. Their sum is 51.60. The mean of the squared deviations is 
5.160. The standard deviation is the square root of this, or 2.27. This 
should not be reported to more than one decimal place. In terms of the unit 
of our measuring scale, this is 2.3 cycles per second. 

The Interpretation of a Standard Deviation. Now that we have the answer 
2.3 cycles per second, how shall we interpret it? The usual and most 
accepted interpretation is in terms of the percentage of cases included within 
the range from one standard deviation below the mean to one standard 
deviation above the mean. This range on the scale of measurement includes 
about two-thirds of the cases in the distribution. Ina normal distribution, it 
is known that from — 17 (one standard deviation below the mean) to +1¢ 
(one standard deviation above), nearly 68.27 per cent of the cases are found. 
Since most samples yield distributions that depart to some degree from nor- 
mality, we say, “about two-thirds," which is, of course, a little short of 68.26 


n dines ERR D o RR 


io +o 
Fic. 5.4. Approximate fractions of the area under a normal distribution curve (also frac- 
tions of the V casesin a normally distributed sample) that lie within one standard deviation 
of the mean and also beyond the limits of one standard deviation, in either direction. 


percent. Figure 5.4 illustrates the division of the area under a normal curve 
into regions marked off at —1с and +10. With two-thirds of the surface 
within those limits, there is left one-third of the area to be divided between 
the two “tails” of the distribution—one-sixth below the point at — 1 and 
one-sixth above the point at +410. 

In the problem just solved, where we found ¢ equal to 2.3, the distance from 
10 to +10 on the scale of measurement is 10.9 to 15.5 cycles; il e., the mean 
13.2 minus 2.3 is 10.9, and the mean plus 2.3 is 15.5 cycles. Within these 
limif are all measurements of 11, 12, 13, 14, and 15. By actual count, there 
are four 11’s, three 13’s, and one 15, or 8 of the 10 measurements within these 
limits, whereas we should have expected 7. But, because of the small number 
of cases and the fact that the distribution is irregular, we should not be sur- 
prised at this result. In other problems this comparison serves as a rough 
check upon the accuracy of computation оѓо. It will not catch all errors but 
will indicate gross errors if the sample is not too small and the distribution is 
fairly normal, 

Grouping Deviations as a Short Cut. Some saving in time and effort can be 

afforded in the solution of the standard deviation in data like those in Table 
5.5, if we group them as in Table 5.6. Since the same measurement is 
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repeated several times and its deviation from the mean is the same every 
time, and also its deviation squared, we need to find the deviation and its 
square only once and multiply each x? by its frequency. The last column of 
Table 5.6 contains the fx* products, and it will be seen that their sum is again 
51.60, from which the standard deviation will be the same as before. The 
formula for this reads 


fxh va 
Nae (Standard deviation from grouped data) (5.9) 


where the symbols are defined as before. 


TABLE 5.6. CALCULATION OF THE STANDARD DEVIATION IN GROUPED DATA WITH 
THE Use or ACTUAL DEVIATIONS 


A similar treatment may be given all grouped data, in which we let the 
midpoint of each interval represent all cases within the interval, and this 
value (Х;) minus M gives the deviation of all cases within the interval. From 
here on, the procedure is the same as that in Table 5.6. We shall not illus- 
trate the steps by means of a special problem, for there are more efficient 
ways of dealing with grouped data, ways that will now be described. 

The Standard Deviation by the Code Method. The code method, which 
was employed in the preceding chapter to calculate a mean (Table 4.2), will 
now be extended in order to compute a standard deviation. The first steps 
are identical with those employed to compute a mean. The whole process of 
computing a standard deviation by the code method can be carried through 
to the final step in terms of the coded values. That is, we can use the x! 
deviations from the temporary origin (see p. 56). The main formula is! 


(Standard deviation from grouped and coded values) (5.10) 


1 Proof bearing upon the effect of coding upon the standard deviation will be found in 
Appendix A. 
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where 2 = size of class interval 
x’ = deviation from the origin of coded values 
M, = mean of the coded values 
For convenience in computation, the formula may be modified to read 


= x VN — (Tir) (Alternate for (5.10)] (5.11) 


TABLE 5.7. CALCULATION OF THE STANDARD DEVIATION USZ THE Соре METHOD 


a) 2) | @) (4) (5) 
Score f * Pd Ja 
55-59 1 T5 T5 25 
50-54 1 +4 +4 16 
45-49 3 +3 +9 27 
40-44 4 T2 +8 16 
35-39 6 +1 + 6 6 
30-34 7 0 0 0 
25-29 12 =í —12 12 
20-24 6 -2 -12 24 
15-19 8 -3 —24 72 
10-14 2 —4 -8 32 

50 —24 230 
N Xfx’ Efx? 
>/х' _ —24 
My = е е — 48 


е = 54/1395, — (—.48)# = 5 4/16 — 2304 = 5 VA. 3600 = 5 X 2.09 = 10.45 


The code method is illustrated in Table 5.7, which is similar to Table 4.2 


through column 4. For all class intervals, we need to know the fx’? products, 
and these are given in column 5. In each row, the fx’? product is found by 
multiplying the corresponding numbers in columns 3 and 4; i.e., the first one, 
25, is the product of 5 X 5; the second one is the product of 4 X 4; and the 
third, the product of 3 X 9; etc. This is because the product fx’? may be 
factored as (fx’)x’. It is excellent checking procedure to do the multiplying 
also by the product (f) X () for each interval. 

Next we sum the fx’? products to obtain fr. In Table 5.7, this is 230. 
To find Mz, we divide Efx’ by V. In this case, it is —24¢5, which equals 
—0.48. We need Me, which is 0.2304. Now, to apply formula (5.10), we 
need next to divide Xfx’? by N, or 23956, which equals 4.6. Deduct M. 
from this, or 4.6 — 0.2304, and we have 4.3696. The square root of this is 
called for next, and this is 2.09. The last step is to multiply by i, the size of 
the class interval; 2.09 X 5 equals 10.45, which is the standard deviation we 
have been seeking. 


он. 5] MEASURES OF VARIABILITY 93 


We may now say that about two-thirds of the individuals should be 
expected between the mean minus 10.45 and the mean plus 10.45. Since the 
mean is 29.6, these limits are 19.2 and 40.0. Fortunately, for the sake of 
checking on this conclusion, these limits are close to the division points 
between class intervals (see Table 5.7). The four intervals included within 
these limits have in them 31 cases altogether, which are 62 per cent of the 
whole group. This is a little short of two-thirds but not unreasonably so. 

Rough Checks for a Computed Standard Deviation. The kind of comparison 
just mentioned is a rough check for the correct solution of the standard devi- 
ation. If the actual percentage of cases between +-1с and — 1с deviates too 
far from 68 per cent, there is probably something wrong with the calculation, 
and a recalculation is in order. This check cannot often be satisfactorily 
applied with grouped data because the frequencies from — 10 to +10 cannot 
then be accurately determined. 

Another rough check is to compare the standard deviation obtained with 
the total range of measurements. In large samples (V = 500 or more) the 
standard deviation is about one-sixth of the total range. Stated in other 
terms, the total range is about six standard deviations. In smaller samples, 
the ratio of range to standard deviation becomes smaller, as indicated in 
Table 5.8. 


TABLE 5.8. RATIOS OF THE TOTAL RANGE TO THE STANDARD DEVIATION IN A 
DISTRIBUTION FOR DIFFERENT VALUES or N* 


* Adapted from Snedecor, G. W. Statistical Methods. Ames, Iowa: Collegiate, 1940. P. 85. 


In the ink-blot data, since N= 50, we should expect the range to be 4.5 
times the standard deviation. The standard deviation 10.45 times 4.5 gives 
us an expected range of about 47 points. Actually the range was 46 points, 
which checks so closely as to give us confidence that our standard deviation 
is at least not grossly in error. 

It may seem strange that we use a less reliable statistic like range as a 
criterion of accuracy of a more reliable statistic like the standard deviation. 
The reasons are that (1) there can hardly be any error in computing such a 
simple thing as the range, whereas (2) there are chances of gross errors in 
calculating « because of the many steps involved, for example, failing to make 
the final step of multiplying by 7. 

A Summary of Steps for Computing the Standard Deviation. The steps 
necessary for the calculation of с by the code method are as follows: 
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Step 1. Complete steps 1 through 6 already listed for finding the mean by 
the code method (see Table 4.2). 

Step 2. Find for every class interval the fx’? product. The most efficient 
way is to compute the product of x’ times fx’ for each interval. 
"These products will all be positive. 


Step 3. Sum the fx’? products. 

Step 4. Divide this sum by N, carrying to at least two decimal places. 
Step 5. Find Mu, to at least two decimal places. 

Step 6. Deduct the number found in step 5 from that found in step 4. 
Step 7. Find the square root of the number found in step 6, keeping two 


decimal places. 
Step 8. Multiply this number by the size of the class interval. If M is large, 
report two decimal places; if small, round to one decimal place. 
Step 9. Interpret the standard deviation in terms of the two-thirds principle. 
Step 10. Apply the rough check of comparing ø with the range and using the 
ratios of Table 5.8. 


The Standard Deviation from Original Measurements. If the number of 
measurements is not large, if the measurements themselves are small num- 
bers, particularly when a good calculating machine is available, the best pro- 
cedure for computing a standard deviation is by means of the formula 


1 Standard deviati ted with- 
с=ў үх C61 


in which the essential steps are: 


Step 1. Square each score or measurement. 

Step 2. Sum the squared measurements to give DX?. 

Step 3. Multiply ХХ? by N to give NEX? 

Step 4. Sum the X's to find EX. 

Step 5. Square the ХХ to find (ZX)*. 

Step 6. Find the difference NEX — (ZX)*. 

Step 7. Find the square root of the number found in step 6. 

Step 8. Divide the number found in step 7 by N (or multiply it by 1/N). 


On the calculating machine, the X's and the X?'s can be accumulated at. 
the same time according to instructions provided with the machine. In 
tabular form, the solution of this kind is illustrated in Table 5.9. 

Grouping Original Measurements. If the scores are conveniently grouped 
and their frequencies tabulated, as in Table 5.10, some saving in work can be 
effected. The steps by which we arrive at I and N should now be easy 

In this, and in the following steps, it is assumed that we are dealing with integral 


measurements. If they are in terms of decimal fractions or multiples of 10 or 100, this 
rule applies only after making the necessary allowance for the place of the decimal point. 


сн. 5] MEASURES OF VARIABILITY 95 


TABLE 5.9. CALCULATION OF THE STANDARD DEVIATION FROM THE ORIGINAL 
MEASUREMENTS AND UNGROUPED DATA 


x x: 
13 169 
17 289 
15 225 
11 121 
13 169 
17 289 
11 121 
13 169 
11 121 
11 121 

132 1,794 
zx zx: 


в = Mo У/10(1,794) — 132? 
= Mo V17,940 — 17,424 
= о V516 

22.7 


10 
= 2.27, or 2.3 


TABLE 5.10. CALCULATION OF THE STANDARD DEVIATION FROM THE ORIGINAL 
MEASUREMENTS, WITH GROUPING 


to follow by an analogy to the last previous solution. Once those values are 
obtained, steps 6 to 8 above can be followed to arrive at o. The formula for 


this procedure is 
1 3 i 
2-3 WI. (Qux) bath pe (21), (513) 


Correction of the Standard Deviation for Coarse Grouping. We are now 
ready to see more clearly why the number of class intervals should not be too 
small in grouping data or the class interval too large. Reference was previ- 
ously made (p. 50) to a "grouping error." Let us see what the grouping 
error is and how it affects the standard deviation. 

‚ This phenomenon is illustrated in Fig. 5.5. There, a distributibn is drawn ` 
with only five intervals. Our computations with grouped data thus far have 


96 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION len. 5 


assumed that all the values within an interval may be given a class value 
corresponding to the midpoint of the interval. In coarse grouping the mid- 
point value is not a very exact representative one because the cases are not 
distributed evenly, or even symmetrically, within the interval. The only 
exception to this is the interval that may happen to straddle the mean, in 
which case the midpoint and the average of the cases in the class will coincide. 

Tn other intervals, note that the frequencies are greater toward the limit 
on the side nearer the middle of the distribution. If we computed an actual 


NONE iidem 
Actual means Miapoinfs 


of c/ass va/ues of class intervals 


Fic. 5.5. Illustration of grouping errors resulting from letting the midpoint of each class 
interval represent all cases within the interval rather than using the mean of the values in 
that interval. The smaller the number of intervals, the greater the error. 


mean of the cases within each interval, we should find it nearer the mean of 
the entire sample than the midpoint is, The difference between the class 
mean and the midpoint of an interval is the grouping error in that interval. 
Above the sample mean the grouping errors are ordinarily positive (midpoint 
greater than the class mean) and below the sample mean the errors are 
ordinarily negative (midpoint less than the class mean). The effect of the 
grouping errors upon the computation of a mean is usually almost nil because 
they are fairly well balanced. But their effect upon the average deviation, 
and especially upon the standard deviation, is often large enough to be con- 
cerned about. Grouping errors tend to enlarge the standard deviation, and 
the coarser the grouping, the greater is this systematic error in c. 

Sheppard’s Correction. When a correction in с is necessary, Sheppard’s 
formula, developed for this purpose, serves very well. When applied to a 
known standard deviation, it reads 


ii 
OF " — 12 (Sheppard's correction in с for coarse grouping) (5.14) 


where т = standard deviation corrected for errors of grouping 


а = uncorrected standard deviation computed from data grouped in 
class intervals : 


i = size of the class interval 
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To apply the correction earlier in the operations, as in connection with 
formula (5.10), we have 


Df: f/xR/N (Solution of ø with Shep- 
si BEI PY ems Spada e ie (us 
N N cluded) 


It has been stated that when the size of class interval, 7, is equal to. 490, 
Sheppard's correction amounts to only about 1 percent. Suchan error could 
be tolerated unless very precise calculations are going to be done with c after 
it is computed. If an interval is about one-half c (i. e., .49о), as just stated, 
and if the sample is large, with a range of about six standard deviations, we 
should then have 12 class intervals. For large samples, then, 12 class inter- 
vals is a minimum for accurate computation of the standard deviation. If 
there are less than 12, for accurate work we should apply Sheppard’s correc- 
tion. Whether or not we apply this correction, therefore, depends upon the 
size of sample, the number of intervals, and the use we intend to make of o. 


DESCRIPTIVE USE or STATISTICS 


Thus far, the chief uses proposed for measures of central value and of dis- 
persion have been as simple values descriptive of total distributions. This is 
best appreciated when we compare different samples. As an illustration of 
this, see Table 5.11, in which we have a few samples of Army General Classifi- 
cation Test data, each based upon a different civilian occupational group. 
We shall not concern ourselves at the moment with the question of how 
adequate these particular samples are either for size or for representativeness 
of the populations from which they are purported to come. These consider- 
ations are, of course, important if we want to generalize our conclusions to 
those populations. We can still compare samples as such. 

Some general conclusions can be drawn from the inspection of Table 5.11. 


TABLE 5.11. STATISTICS DESCRIBING DISTRIBUTIONS OF SCORES FOR SELECTED 
OCCUPATIONAL GROUPS WHO TOOK THE ARMY GENERAL CLASSIFICATION 
Test purING Wortp War II“ 


Occupation N M Mdn | с Range 
Accountant. ........:: 172 128.1 128.1 11.7 94-157 
Lawyer... 94 127.6 126.8 10.9 96-157 
Reporter. 45 124.5 125.7 11.7 100-157 
Sales clerk 492 109.2 110.4 16.3 42-149 


128 102.7 104.8 16.0 56-139 


Plumber. 
Truck driver 817 96.2 97.8 19.7 16-149 
Farm hand. 817 91.4 94.0 20.7 24-141 


Teamster. 7 87.7 89.0 19.6 45-145 


ae e аер руа 
* From Harrell, T. W., and Harrell, M. 8. Army General Classification Test scores for civilian 


occupations. Educ, psychol. measmt., 1945, 8, 229-240. By permission of the publisher. 
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When the means and medians are placed in rank order, it will be seen that the 
occupational groups fall into an approximate rank order for socioeconomic 
level. It is also apparent, as should have been expected, that occupations 
requiring more “headwork” are highest in the list. The test emphasized 
verbal, reasoning, and numerical facilities. 

The importance of having both means and medians lies in the information 
they give concerning skewness. For the lower occupational groups, particu- 
larly, the medians are slightly higher than the means. This indicates slight 
negative skewing. This is a somewhat surprising result, for one would expect 
that the higher the mean, the greater the negative skewing, and the lower the 
mean, the greater the positive skewing. When a test of moderate difficulty 
is administered to a group of low average ability, scores tend to bunch at the 

‚ lower end of the scale (positive skewing). When the same test is given to a 
group of high average ability, the bunching is expected near the upper end of 
the scale (negative skewing). Since in the data of Table 5.11 the skewing 
seems to be negative for most occupational groups and most marked for those 
of low average ability, some explanation is demanded. We can only specu- 
late, which means we can suggest several hypotheses which would need 
further investigation in order to evaluate their worth. One hypothesis 
might be that in any occupational group, particularly among those of lower 
ability in the test, a minority of the examinees were very poorly motivated or 
took the test under adverse conditions so that they did not perform up to 
their characteristic level. 

Two indices of dispersion are given: the standard deviation and the total 
range. Each tells its own story. Standard deviations are more meaningful 
here if it is remembered that for the total range of scores, all occupational 
groups combined, the standard deviation was approximately 20.0. The 
scaling that was utilized aimed at a standard deviation of 20.0 and a mean of 
100. The mean in some forms of the test turned out to be somewhat above 
100, We should expect dispersions within selected occupational groups to be 
smaller than the dispersions for all occupations combined. With three 
exceptions in Table 5.11, this is true. On the whole, the higher the occupa- 
tional group and the higher the mean, the smaller the dispersion. The higher 
groups should not be expected to scatter so far from the mean, because the 
mean score approaches the highest scores made by individuals in any group. 
We might expect a similar curtailment for groups with lowest means. But a 
study of the ranges will show why this did not occur. 

The ranges, as such, are surprisingly large for all groups. It is hard to 
imagine any individuals in the professional groups with scores below the 
general average, unless those scores were low because of poor motivation 
or because of advancing age, which is associated with slower rate of work. 
The test was a speed test. The lowest scores for the lower occupational 
groups are in line with expectations, but the maximum scores in those same 


cm. 5] MEASURES OF VARIABILITY 99 


groups are illuminating, Many a clerk or truck driver could evidently have 
successfully undertaken training for one of the professional occupations. In 
their prewar assignments they for some reason did not take full vocational 
advantage of their abilities. It is this fact and also the fact that men of very 
low academic abilities can engage successfully in the occupations like farm 
hand and teamster that are largely responsible for the unusually wide dis- 
persions of scores in such occupational groups. ' 

In this discussion we are not particularly interested in settling points con- 
cerning the relation of mental abilities to occupational level or success. The 
data were presented here merely as an illustration of the kind of inferences 
one may draw from a set of statistics and the hypotheses that may be set up 
for further investigation, possibly of a very fruitful nature. Such inferences 
and hypotheses would be impossible to make without this kind of inspection, 
and the inspection is made possible by having the statistical information. 


Uses AND INTERRELATIONSHIPS OY DIFFERENT MEASURES 
or DISPERSION 


Choice of the Statistic to Use. Several considerations come into the pic- 
ture when we decide what measure of variability to employ in any situation. 
One is the reliability of the statistic, its relative constancy in repeated sam- 
ples. In this respect, the statistics come in the order, from most reliable to 
least reliable: standard deviation, average deviation, semi-interquartile range, 
and total range. So far as quickness and ease of computation are concerned, 
the four are almost in reverse order to that just given. If further statistical 
computation is to be given the data, such as estimating reliability of the mean 
and of differences between means, computing coefficients of correlation, 
regression equations, and the like, then the standard deviation is by all odds 
the one to employ. 

As between standard deviation and average deviation, there is sometimes a 
choice, The standard deviation, because it derives from squared deviations, 
gives relatively more weight to extreme deviations from the mean. If a 
distribution should have an unusual number of extreme cases in one or both 
directions from the mean, some investigators prefer the average deviation to 
the standard deviation. This rule includes cases of markedly skewed 
distributions. 

The semi-interquartile range gives even less importance to extreme devia- 
tions than does the average deviation and would sometimes be given prefer- 
ence to both standard and average deviations for this reason. It gives more 
importance to the central mass of cases. When the median is the measure of 
central value adopted, Q should naturally be the companion measure of 
variability. Both are based upon the same principles, When distributions 
are truncated, or have some indeterminate values, only Q can justifiably be 
used to indicate invariability. 
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To recapitulate, 


1. Use the range when 
a. The quickest possible index of dispersion is wanted. 
b. Information is wanted concerning extreme scores, 
2. Use the semi-interquartile range, Q, when 
. The median is the only statistic of central value reported. 
. The distribution is truncated or incomplete at either end. 
There are a few very extreme scores or there is an extreme skewing. 
. We want to know the actual score limits of the middle 50 per cent of 
the cases. 
3. Use the average deviation when 
a. There are extreme deviations, which, when squared, would bias. 
estimation of the standard deviation. 
b. A fairly reliable index of dispersion is wanted without the extra labor 
of computing a standard deviation. 
c. The distribution is nearly normal and we can therefore estimate с 
from the AD [see formula (5.18)]. 
4. Use the standard deviation when 
a. Greatest dependability of the value is wanted. 
b. Further computations that depend upon it are likely to be needed. 
c. Interpretations related to the normal distribution curve are desired. 
Tt will be found in a later chapter that the standard deviation has a 
number of useful relationships to the normal curve and to other 
statistical ideas. 


Lu сё 


Relationships among the Measures of Dispersion. Previously, the stand- 
ard deviation was related roughly to the range of measurements in a sample. 
In the general run of samples one meets in statistical work, the range varies 
from four to six times the standard deviation (see Table 5.8), depending upon 
the size of sample. If the distribution with which we deal is normal, or 
nearly normal, in form, we can use a number of other relationships. In a 
strictly normal distribution the following relationships hold: 


О = .845AD = .6745с (Conversion of one measure of disper- (5.16) 
AD = 1.1830 = .798¢ nee EA damn assuming a normal (5.17) 
= 1.4830 = 1.2534D Ei (5.18) 


These equations are most useful for checking purposes when for some 
reason we have computed two or more of the statistics. They are also useful 
in estimating one measure of dispersion from another when we do not take 

` the trouble to compute more than one. This should be done only with great 
caution, however, being assured both that the distribution is close to normal 
and that the one computed statistic is correct. 
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Tue COEFFICIENT OF VARIATION 


Absolute versus Relative Variability. Measures of variability are not 
directly comparable unless they are based upon the same scale of measure- 
ment with the same unit. Itis even questionable whether one should com- 
pare absolute variabilities on the same measuring scale when two groups have 
decidedly different means. For example, the variability in height of infants 
might naturally be expected to be less than the variability in height of adults. 
If we are interested in comparing the variability in height of infants, as 
infants, with variability in height of adults, as adults, we need to consider 
infantand adult norms. These norms are naturally given in terms of means 
or medians. We are here concerned with relative variability rather than 
absolute variability. The question is more correctly stated by saying, “Is 
the variability of infants’ heights in ratio to their mean as great as the vari- 
ability of adults’ heights in ratio to their mean?" We therefore need to know 
the ratio of the standard deviation to the corresponding mean. It is custom- 
ary to multiply this ratio by 100, which tells us what percentage of the mean 
the standard deviation is. The formula is 


1007 
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(Coefficient of variation) (5.19) 


Relative Variability and Weber’s Law. One important application of the 
coefficient of variation is in the field of psychophysics. If we ask an observer 
to duplicate a 90-mm. line by freehand drawing 50 times and if we then com- 
pute the mean and standard deviation of his reproductions, we may expect a 
mean something like 107 mm. and a standard deviation of about 5 mm. His 
coefficient of variation is 4.7; or, in other words, his variability is 4.7 per cent 
ofhis mean. In duplicating a line of 180 mm. 50 times, let us say that his 
mean is 195 mm. and his standard deviation is 8 mm. The variability has 
increased as wellas his average. According to Weber's law, it should have 
kept in step with his increase in average, and the coefficient of variation 
should consequently be the same. CV is now 4.1 per cent, or almost the 
same as before, but is perhaps lower than Weber's law requires. Results in 
the past have typically shown that, with increasing mean, the absolute vari- 
ability does increase though not so rapidly in proportion, so that the relative 
variability decreases and does not remain constant, as according to Weber's 
law. We are not concerned here particularly with the validity of Weber's 
law except as it illustrates the importance of relative variability. 

When Not to Apply the Coefficient of Variation. One important word of 
caution is necessary concerning the application of CV. It should not be 
applied unless we are rather certain that our measuring scale is one a equal units 
and, above all, unless the absolute zero point is taken into account. These 
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qualifications almost entirely confine us to measuring scales with physical 
units, such as linear distances, weights, and time. They rule out ordinary 
test and examination scores, even mental-age and 10 units, and thus mate- 
rially reduce the areas of application of CV in psychological investigations. 

To illustrate the seriousness of this, let us note a fictitious but not unreason- 
able example. In a certain psychological test composed of items the mean 
is 8.5 and the standard deviation is 3.4. The coefficient of variation would 
be340/8.5 = 40.0. The standard deviation is 40 per cent of the mean. But 
remember that scores on such tests do not represent distances from a mean- 
ingful or absolute zero point. Let us assume that an obtained score of zero 
on this test actually represents an ability that is 12 units above the genuine 
zero point, 12 units of the same order of magnitude of the units within the 
obtained range of scores. On such an "absolute" scale, the mean of the 
scores would be 20.5 rather than 8.5. The standard deviation would remain 
the same, 3.4, since we have in effect merely added 12 points to each person's 
score and have not disturbed the scores’ relative positions. The CV now 
becomes 340/20.5 — 16.6, or less than half what it was before, while the 
absolute variability has remained the same. 


Exercises 


1. Compute the interquartile and semi-interquartile ranges for the distributions in 
Data 44, 4B, and 4F. Interpret your findings. 
2. Compute the standard deviation for any or all of the distributions in Data 44 to 
4F inclusive. Use any of the formulas that seem most convenient. Interpret your 
findings. 

3. Compute the standard deviation in any or all of the distributions in Data 4G. Use 
any of the formulas that seem most convenient. * 

4. Compute the average deviation for any or all of the distributions in Data 4G. 

5. Decide which measure of variability is wisest to employ with each of the distributions 
in Data 44 to 4F inclusive and which is second best. Give reasons. 

6. In which of the same distributions would one be justified in computing a coefficient 
of variation and in which ones not? Give reasons. 

1. Compute the standard deviation for Data 5A, with and without Sheppard's correction. 


Dara 54. ScORES IN A FINAL EXAMINATION 


Scores Frequencies 


70-79 1 
60-69 4 
50-59 10 
40-49 15 
30-39 8 
20-29 2 


8. Compute the coefficient of variation for each distribution in Data 5B. Interpret 
the table as it stands, and also your computed coefficients. 
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Data SB. Scores IN THREE Motor Tests 


Tapping rate 
Test 


Men Women 


Mean... 210.4 184.0 5:13 
Standard deviation. 20.0 19.3 1.9 
осо ПАРОВА 101 161 5 


Answers 


1. Q: 3.5; 7.7; 1.19. 

2. 4.58; 10.86; 9.78; 9.75; 13.92; 10.42; 12.64; 9.97; 2.12; 2.77; 1.69; 1.30. 
3. 3.2; 4.4; 3.2; 7.6; 7.5. 

4. 2.7; 3.8; 2.3; 6.2; 6.4. 

7. а = 10.68; o = 11.07. 

8. 9.51; 10.5; 15.2; 20.1; 28.4; 37.0. 


CHAPTER 6 


CUMULATIVE DISTRIBUTIONS AND NORMS 


Many statistical procedures, particularly those applied to test scores, are 
based upon the cumulative frequency distribution. Heretofore we have 
given frequencies as belonging to certain scores or to class intervals. In this 
chapter, we are interested in the number of scores or measurements falling 
below a certain point on the measuring scale. The cumulative frequency 
corresponding to any class interval is the number of cases within that interval 
plus all those in intervals lower on the scale. 


CUMULATIVE FREQUENCIES AND CUMULATIVE DISTRIBUTION CURVES 


How to Find the Cumulative Frequencies. The cumulative frequencies 
are very readily found from the ordinary noncumulative frequencies. Our 
first example is with the already familiar ink-blot-test scores (see Table 6.1). 


TABLE 6.1. CuMULATIVE FREQUENCY DISTRIBUTION FOR THE INK-BLOT-TEST DATA 
с! SR ae We ale АБ ЧЫРЫ 0 Ч 


ш 0) © @ 
Scores in the | Exact upper limit Y Ae 
intervals ofthe interval | Frequencies Cumulative 
frequencies 
55-59 59.5 1 50 
50-54 54.5 1 49 
45-49 49.5 3 48 
40-44 44.5 4 45 
35-39 39.5 6 “ 
30-34 34.5 1 3s 
25-29 29.5 12 28 
20-24 24.5 6 16 
15-19 19.5 8 10 
10-14 14.5 2 2 


We list the scores in the first column just as before, with high scores at the 

top, giving in column 1 the score limits of the class intervals. We next want 

a single score value to assign to each interval. Where before we used the 

midpoint, now we choose the exact upper limit. The reason is that the fre- 
104 


cH. 6] CUMULATIVE DISTRIBUTIONS AND NORMS 105 


quency to be given corresponding to it will be all the cases within the class 
and below it. All those cases fall below the exact upper limit of the class. In 
column 3 are given the ordinary frequencies and in column 4, the cumulative 
frequencies, The cumulation is started at the bottom of the list in column 3, 
Below the upper limit of the lowest interyal (14.5) are two cases. Below the 
upper limit of the second interval (19.5) are these two plus the eight in the 
second interval, giving 10 as the cuniulative frequency. In the third interval, 
we find six cases to add onto what we already have, making 16 for the third 
interval. And so it goes, each cumulative frequency being the sum of the 
preceding one and the frequency in the class interval itself. This continues 


Frequencies 


A 
LY lol jen 
— 1 З 
0 5 10 15 20 25 30 35 40 45 50 55 60 
Scores 


Fio. 6.1. A cumulative frequency distribution curve for the ink-blot test. 


until the last (top) interval is reached. The last cumulative frequency 
should be equal to W (here it is 50); if not, some error has been made. 

Plotting the Cumulative Distribution. Figure 6.1 shows the cumulative 
frequencies we have just obtained in Table 6.1, plotted against the corre- 
sponding scores (exact upper limits). The plotting here follows much the 
same routine as prescribed in Chap. 3, except that here we never plot the 
histogram form, only the type that connects neighboring dots with straight 
lines. Obviously we do not obtain a polygon but rather an S-shaped curve. 
In order to bring the curve to the base line at the left, we assume that a zero 
frequency comes at the lower limit of the bottom class interval (which is the 
same as the top of the interval just below it). As before, the total figure is 
about 60 to 75 per cent as high as it is wide. 

Determining Quartiles Graphically, It is of interest to point out here the 
ease with which the quartiles can be graphically determined or read off the 
curve in Fig. 6.1. To find the median (Oe), we first locate the frequency of 25 
(N/2) on the vertical axis. Draw a horizontal line over to the curve at this 
level. At the point where it intersects the curve, drop a perpendicular to the 
base line. Where this cuts the base line, read the score value. On ordinary 
graph paper, Qs can be read accurately to one decimal place. 01 would be 
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similarly determined at the level of 12.5 on the frequency scale and Qs, at the 
level of 37.5. 

Distribution of Cumulative Percentages and Proportions. Previously we 
have had reason to transform frequencies into percentages for the sake of 
comparing two distributions where. № differs (Chap. 3). The same reason, 
plus more important ones, prompts us more frequently to transform cumu- 
lative frequencies into percentages. In Table 6.2, another example of cumu- 


TABLE 6.2. CUMULATIVE FREQUENCIES, PERCENTAGES, AND PROPORTIONS FOR 
Memory-test SCORES 


(1) (2) (3) | @) (5) (6) 


Cumulative 
Scores X T d % ср 
сР 
41-43 43.5 1 86 100.0 1.000 
38-40 40.5 4 85 98.8 988 
35-37 | 37.5 5 81 94.2 942 
32-34 34.5 8 76 88.4 884 
29-31 31.5 14 68 79.1 791 
26-28 28.5 17 54 62.8 .628 
23-25 25.5 9 37 43.0 .430 
20-22 22.5 13 28 32.6 .326 
17-19 19.5 8 15 17.4 .174 
14-16 16.5 3 7 8.1 .081 
11-13 13.5 4 4 4.7 ‚047 
8-10 | 10.5 0 0 0.0 .000 
— ЕЛЕЕ ЛАА ЕНА . — 


lative frequencies is given. They are obtained here (column 4) just as before. 
We now wish to find what percentage of 86 each cumulative frequency is. 
The arithmetic is simply a matter of multiplying each cumulative frequency 
by100/N. This fraction, 10046, is equal to 1.1628. It is well here to keep a 
liberal number of decimal places. In Table 6.2, the cumulative percentages 
in column 5 are obtained by multiplying each frequency in column 4 by 
1.1628. "These need not be given to more than one decimal place. Some- 
times it is preferable to work in terms of cumulative proportions, which are 
given in column 6. Whereas with percentages the base is 100, with propor- 
tions the base is 1.00. Each proportion is therefore simply Моо of the corre- 
sponding percentage. Thus, ср = .011628 x €f. The reason for using pro- 
portions will be explained later; here we shall be concerned with percentages. 
The Cumulative Percentage Curve, or Ogive. In Fig. 6.2, the cumulative 
percentages we have just obtained in Table 6.2 are plotted as points against 
the corresponding score points (exact upper limits of class intervals). Again, 
an S-shaped curve results. Now that it is standardized as to height, it is 
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sometimes called an ogive. The ogive is, in other words, the cumulative per- 
centage distribution curve. Two ogives are much more readily compared 
than two ordinary cumulative curves because of their common height. But 
this is not the only use of an ogive, as we shall soon see. 


CrNTILE Norms 


Finding Centile Points by Interpolation. A centile point (often called simply 
“centile” for the sake of brevity) is а value on the scoring scale below which are any 
given percentage of the cases.? For example, the 90th centile is the point below 
which are 90 per cent of the scores, and the 24th centile is the point below 
which are 24 per cent of the scores.* 

Deciles and Tenths. We have already seen how to interpolate in order to 
compute a median and other quartiles. Actually, the median is at the 50th 
centile, Q; is at the 25th centile, and O is at the 75th centile. It is but astep 
further to generalize this to any centile one desires. We could choose to 
interpolate any centile; the 63d, the 81st, or the 8th. Our interest in testing 
happens to stress the centiles that are multiples of 10—the 90th, 80th, 70th, 
etc., down to the 10th. These are called the deciles, for they divide the dis- 
tribution into tenths, just as the quartiles divide it into quarters and the 
median, into halves. 

The Process of Interpolation. The principle of interpolating is not new. 
Table 6.3 shows how we may work out the deciles systematically. The com- 
plete headings of the table make the work almost self-explanatory, but let us 
follow through one or two examples. First we need to know how many cases 
out of the total of 86 we need to include in any given percentage. Ninety 
per cent of 86 is 77.4, which we find in column 2. We must count up the 
scoring scale among the frequencies until we include 77.4 cases. Reference 
to Table 6.2 shows that we get by accumulation 76 cases up to the score point 
34.5. We need 1.4 more cases among the 5 in the next higher interval. 
There are three score units in the interval, and so we have to proceed 1.4/5 
times 3, or, as given in columns 4 and 5 of Table 6.3, we add to 34.5 the 
amount (1.4 X 3)/5, which gives 35.3 as the centile point. We say that Poo 
(90th centile) equals 35.3. To take a second example, let us solve for Pio. 
Ten per cent of 86 is 8.6. Counting up to a score point of 16.5, we find 7 
cases, which leaves us needing 1.6 more out of the 8 in the next interval. Pio 


1 The ogive may also be in terms of cumulative proportions, since proportions and per- 
centages are used interchangeably. 

2 The term centile is often called (superfluously) percentile in the literature. There is 
about as much excuse for speaking of perdecile or of perquartile. 

3 The term centile, without reference to a scale of measurement, strictly speaking, should 
mean centile rank, which means a rank position among a hundred rank positions, When 
the term is used alone, the context will indicate whether centile rank or ckntile point is 


meant. 
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TABLE 6.3. CALCULATION OF CENTILES, OR CENTILE POINTS BY INTERPOLATION IN 
THE MEMORY-TEST DATA 


(1) 2) (3) (4) (5) (6) 
Cumulative fre- | Lower limit a 
Centile rank | Number of E Distance 
(Percentage | cases below eaey dieu К a of centile | The centile 
E below the in- containing в 3 
below the | the centile tezvalcontiining | the сао | Pome above point 
centile point) point e TR yes lower limit 
the centile point point 
90 77.4 16 мз + 4X3 35.3 
80 68.8 68 31 + 3X3 31.8 
70 60.2 E 28.5) 1 20.8 
60 51.6 37 да SA от 
50 43.0 37 258. SKS 26.6 
40 34.4 28 22,5. 4 SEXS 24.6 
30 25.8 15 19,5, 4 22.0 
| 
20 17.2 15 09.5 + 22X 3 20.0 
10 8.6 7 16:57, K 10363 17.1 


is therefore equal to 16.5 + (1.6 X 3)/8, which equals 17.1. The remaining 
centile points are similarly determined and are listed in the last column of 
Table 6.3. 

The Utility of Centile Norms. Test scores of various kinds are f. requently 
interpreted in terms of centile norms, for very good reasons. In the first 
place, a raw score of so many points means very little to us. Tell a student's 
adviser that his advisee made a score of 59 points in an algebra- achievement 
examination, 175 points in an English- achievement examination, and 121 
points in a general scholastic-aptitude test, and without further information 
the adviser does not know whether his advisee is low in all tests, high in all 
tests, or low in one or two and high in the remaining.. But tell him that a 
score of 59 points in algebra is at the 99th centile, the 175 points in English 
is at the 32d centile, and the 121 in scholastic aptitude is at the 48th centile, 
when those centiles were established by the scores from 1,500 freshmen enter- 
ing the university with the advisee in question; then he will have some usable 
information. The student in question is extremely high in algebra, moder- 
ately low in English, and about average in general scholastic aptitude. The 
chief utility of centile norms is (1) to give some conception of the general level 
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of a score in a known population and (2) to put scores from different tests on a 
comparable basis. 

Finding Centile Norms by Interpolation. If we wished to have a table of 
centile norms for the memory test, we could now use the nine decile points 
already found by interpolation as they are listed in the last column of Table 
6.3. Then when a student came along with a score of 22 we could say that he 
is at the 30th centile; another student with a score of 30 is at the 70th centile, 
etc. When a score came up that is not exactly listed we could find its 
centile equivalent by interpolation. For example, a score of 21 would be at 
the 25th centile, and a score of 27 would be at about the 53d centile. 

Centile Norms from Smoothed Ogives. But there are objections to the 
use of interpolated centiles as norms. Chance irregularities in distribution 


Cumulative percentages 
ё 


F 125771730" | 357-40, 4 
Scores in a memory test 


Fic. 6.2. Smoothed cumulative distribution curve for the memory-test scores. Frequencies 
are in terms of percentages. 


from a small sample often give a distorted picture of the true situation that 
probably obtains in the larger population. After all, it is the larger popula- 
tion that we wish to represent in our norms, or at least we should like to com- 
pare future individuals’ scores with something more stable and general than 
our limited sample. For this reason the author strongly recommends that 
centile norms be set up in terms of the smoothed ogive. Interpolated norms 
are derived from the unsmoothed curve and, as was said, they are affected by 
minor irregularities that are probably a peculiarity of this sample only and 
not of the general population. The smoothed ogive may be taken as an 
estimation of the distribution of the general population of which our group is 
а sample. When a sample is large, very little smoothing is necessary. Even 
with small samples, at times surprisingly little smoothing need be done. 

In Fig. 6.2, a smoothed ogive (by inspection and freehand drawing) has 
been drawn. The aim is to bring it as close as possible to all points, and if 
ed by the curve, there should be about as many below 


points must be untouch : З 
If too glaring discrepancies occur between points and 


the curve as above it. 
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curve after smoothing, it is probably best to discard the attempt to use these 
data as a basis for norms or else to add more cases until sampling irregularities 
are greatly reduced. 

Reading Centile Scores from a Graph. Waving satisfied oneself as to the 
smoothed ogive, the next step is to read off the diagram the score points corre- 
sponding to the centile ranks for which norms are required. For this purpose 
the diagram should be enlarged sufficiently for easy reading and the graph 
paper finely ruled so that score points may be accurately read to one decimal 
place. In Table 6.4 are given the score points corresponding to centiles 10 to 
90, as before, but also to 95 and 99 at the upper end and to 5 and 1 at the 
lower end. The reason for including these extra points at the extremes is 
that there is actually a great range of ability above the 90th centile and also 
below the 10th centile. In fact, the range of ability is about as great beyond 
the 90th centile as it is between the mean and the 90th centile, and as great 
below the 10th centile as between that point and the mean, when the distribu- 
tion is normal. 

A Defect in Decile Scales. One defect of the centile scale, as a measuring 
scale, is that it exaggerates individual differences, relatively, near the center 
of the distribution as compared with those near the ends. Giving score 


Taste 6.4. CENTILE NORMS FOR THE MEMORY TEST, DERIVED FROM THE 
SMOOTHED OGIVE 


Centile | Score point | Integral score 
99 40.5 41 
95 37.1 38 
90 34.9 35 
, 80 31.8 32 
70 29.5 30 
60 27.9 28 
50 26.1 27 
40 24.3 25 
30 22.5 23 
20 20.4 21 
10 17.5 18 
5 14.9 15 
1 11.9 12 

Cc — —— 


norms corresponding to selected centiles beyond 10 and 90 compensates for 
this defect to a large extent. Because of this same defect, it is not the best 
practice to work with decile norms, for to de so often leads the user of the 
norms to lay too much stress upon differences among the great average group 
and too little upon those where tests discriminate best. 
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Figure 6.3 illustrates how a decile scale distorts differences along the scale. 
This figure is so drawn that the 10 decile divisions cover the same total range 
as the original scores. The heights of the rectangles are drawn so that the 
total area in the 10 categories combined is equal to that under the original 
curve. The new frequency distribution, when decile ranks are given equal 
distances on the measurement scale, is rectangular. It is as if we had pressed 
down upon the center of the original distribution, forcing the central indi- 
viduals farther apart, and to make up for it we group individuals who are 
spread over the tails of the original curve into narrower categories. 

Another illustration of the distorting effect of decile and centile scales when 
we give equal distances to numerically equal intervals is shown in Fig. 6.6. 
Here are shown parallel scales for the memory test. Corresponding centile 
Distribution based upon 
scores (unimodal) 


Distribution based upon 
‘deciles (rectangular) 


Scales of ability (scores, also decile ranks) 


Fic. 6.3. Showing the same sample distributed along a scale of scores (the unimodal, and 
perhaps normal, distribution) also along a scale of deciles (rectangular distribution). 


ranks and raw scores are connected by dotted lines. From this it will be 
seen, in another way, how raw-score distances near the center become rela- 
tively spread and how equal distances near the extremes are relatively con- 
densed when converted to centile-rank values. 

It is probably best that decile norms, as such, be consigned to the limbo of 
forgotten procedures. In their place the author recommends the use of a C 
scale, which will be described in a later chapter (Chap. 19). Centile norms 
will continue to be useful, but it is urged that they be constructed in a way 
that will give more correct impressions of scale positions, as will now be 
described. 

Integral Centile Points. Before doing that, however, a further word of 
explanation of Table 6.4 is in order. The last column of integral scores” 
is merely a revision of the second column by way of rounding to whole num- 
bers. Tables of norms are frequently given in terms of whole numbers, 
mainly because scores are obtained as whole numbers. We should say that 
an obtained score of 41 is better than 99 per cent of the group can make, and a 
score of 18 is better than only the lowest 10 per cent can make. It should be 
noticed that every fractional score is rounded upward to the next whole number; 
thus 37.1 becomes 38. Since an obtained score of 37 covers a range of 36.5 
to 37.5, more than half of those making this score would wot be better than 95 
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per cent. The first score, counting from below upward, that is totally 
better than 95 per cent is a score of 38. This is why, in this and in other 
cases in this table, we round upward to the next higher integer. 

A Graphic Profile Chart. Many profile charts based upon centiles show 
graphically the deciles at equidistant levels along the scale. This gives an 


TABLE 6.5. THE DISTANCE OF CENTILES FROM THE MEAN IN NUMBER OF 
STANDARD DEVIATIONS IN A NORMAL DISTRIBUTION 


Centile Number of Sigmas 
Rank from the Mean 
99 T2.33 
95 +1.64 
90 +1.28 
80 +0.84 
70 +0.52 
60 +0.25 
50 0.00 
40 —0.25 
30 —0.52 
20 —0.84 
10 —1.28 
5 —1.64 
1 —2.33 


erroneous conception of the relative spacing of ability or talent, as was pointed 
out in a preceding paragraph. Actual differences in ability are probably 
more accurately indicated by the raw-score units than they are by centile- 
rank units, which relatively magnify the central portions of the distribution. 


Centile ranks 


=200 +o 0 +00 +2.00 
А Standard-score-sca/e 
Fio. 64. Showing on parallel scales standard scores and corresponding centile ranks. 
Since standard scores are given equal spacing, centile ranks have unequal spacing. Had 
centile ranks been given equal spacing, standard scores would have had unequal spacing. 


If it is assumed that the actual distribution for the norm group is Gaussian, or 
normal, in shape, the relative spacing of the various centiles that we custom- 
arily include in our norms should be as given in Table 6.5. In the first 
column are the customary centile ranks. In the second column are the 
corresponding distances from the mean (and median) when the standard 
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deviation of the distribution is adopted for convenience as the unit. The 
corresponding centile ranks and с distances are also represented in Fig. 6.4. 
The correspondence of deviation from the mean with centile rank depends 


entirely upon the mathematical rela- 
tions that hold true for the normal 
distribution curve, and the reasons 
for this need not concern us here. 
The author merely proposes to use 
this spacing of the centile ranks in 
setting up a profile chart and has 
done so in Fig. 6.5. 

Here, in Fig. 6.5, each centile is 
drawn at a distance from the mean 
proportional to its corresponding с 
distance given in Table 6.5; i.e., cen- 
tiles 99 and 1 are 2.33 с units from 
the mean, centiles 90 and 10 are 1.28 
units away, etc., though those dis- 
tances are not labeled numerically in 
the chart and need not be. Once 
having located them at the proper 
distances, we may forget the c values. 

Provision has been made for four 
tests in the profile chart: the memory 
test whose norms we have determined 
in previous, parts of this chapter; 
a vocabulary test; a word-building 
test; anda sentence-construction test 
whose norms were determined else- 
where. For the memory test, the 
integral scores have been written in 
at their corresponding centiles, being 
guided by the list of score points in 
column 2 of Table 6.4. Once the 
scores nearest those points are lo- 


cated and written in the diagram, the other, 
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Fic. 6.5. An example of a profile chart based 
upon centile norms. Note that the centile 
ranks are not spaced at equidistant inter- 
vals, but at intervals based upon corre- 
sponding g distances from the mean (see 
Table 6.5 and Fig. 6.4). 


intervening scores can be 


introduced. The same was true for the other test norms though, because of 


crowding, 


some integral scores have been 


omitted. The student whose 


profile is shown earned raw scores of 28, 88, 20, and 23, respectively, in the 
four tests. Those four scores have been encircled and then connected with 


straight lines to complete the profile. 


We can now see the general trend of 


this student’s ability in these four tests taken together, and we can read off 


his centile rating in each test at a glance. 


Furthermore, a much more accu- 


% 
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rate conception of his fluctuation in ability is given than would have been 
true in a diagram with equidistant deciles. 

Figure 6.6 shows how, if we had spaced the centile ranks at equidistant 
intervals, as is sometimes done, the corresponding separations on the score 


Test scores 
10 15 20 25 


30 35 40 
7 7 7 / 7 | N N \ 
/ Fá ^ Y. / / | К NON \ 
%%% \ N x 


0 юж 30 40 о 80 70 80 90 100 
Centile ranks 
Fro. 6.6. Showing parallel scales of centile ranks and corresponding raw scores for the 
memory ‘test. Here centile ranks are equally spaced on their scale, and raw scores are 
equally spaced on their scale. Equally spaced centile-rank intervals, however, correspond 
to very unequal raw-score intervals. 


scale would have been very unequal in different parts of the scale. As a 
general principle, individuals are best discriminated by tests where they are 
spread thinnest in the distribution. 


200, 
180 х+—— Highest score 


Scores in an English examination 
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Fic, 6.7. A graphic device for visual comparisons of distributions, showing important 
centile values and total ranges. 


A Bar Diagram of Disiributions of Scores. A useful graphic device for 
picturing distributions of scores is shown in Fig. 6.7. The bar diagrams 
there illustrate the distributions of three groups of students who were taught 
by three different instructors but who were given the same final examination, 


1 Similar diagrams have been used for some time by the Cooperative Test Service. 


? 
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an objectively scored achievement examination in English. The median of 
each group is marked by a short horizontal line through the bar at the 
median-score level. The range of the middle 50 per cent (from P»; to P75, or 
from 0; to Qs) is shown in each case by the open rectangle. The black bars 
extend out to the points Pio and Psy—in other words, to include the middle 
80 per cent of the cases. The lines extend to points at P; and Pos, or to 
include the middle 90 per cent of the cases. The highest and lowest single 
scores are marked by the small x's. Thus several meaningful centile points 
are labeled, as well as the entire range. 

Interpretation of Bar Diagrams. One important use of bar diagrams is the 
ready comparison of groups that they afford. In Fig. 6.7, for example, it is 
obvious that the three medians come in the order 1, 2, 3 for groups C, B, and 
A, respectively. The variabilities of the three groups come in the order B, 
C, and A when we depend upon total ranges. The groups come in almost the 
same rank order for variability when we compare ranges of middle 90 per 
cent, but again the order B, C, A is probably correct in comparing middle 50 
per cents, though B and C are very close together in this respect. As to 
topmost scores, they come in the same order as for medians, C, B, A, but for 
bottom scores the order is A, C, B. As to skewness, the most symmetrical 
distribution, all things considered, is probably that for group B, and the least 
symmetrical is for group A, which is positively skewed. The special virtue 
of this kind of comparison, as contrasted with that afforded by means of fre- 
quency polygons and ogives, is that many more facts about a distribution can 
be recorded, and yet because of no overlapping of the drawings there is direct 
comparison without confusions. 


Exercises 


1. Carry through the following steps for the first distribution of chemistry-aptitude 
scores in Data 3C (Chap. 3). 
. Find the cumulative frequencies, and tabulate them. 
. Plot a cumulative distribution curve similar to Fig. 6.1. 
. Find the cumulative percentages and proportions, and tabulate them. 
Plot the ogive distribution, showing the smoothed curve. 
Compute the interpolated centiles that divide the distribution into tenths. 
Derive centile norms from the smoothed ogive, and set up a table of norms. 
Prepare a centile profile chart including the norms for this test and for one or two 
others for which you have data. # Жез 
2. Repeat the steps, particularly a, c, d, and f, for any other distribution of test scores. 
3. Prepare bar diagrams like those in Fig. 6.7 for comparing two or more distributions, 
such as the two in Data 3C, or Data 4F (Chap. 4). 


SHS Ra Se 


Answers 
1. a. cf: 266; 262; 252; 238; 219; 187; 156; 116; 88; 59; 38; 20; 10; 4; 3. 
с. ср 100.0; 98.5; 94.7; 89.5; 82.3; 70.3; 58.6; 43.6; 33.1; 22.2; 14.3; 7.5; 3.8; 1.55 1.1. 


e. Decile points: 80.0; 73.5; 69.4; 64.8; 61.6; 57.8; 53.1; 48.1; 41.3. 
f. Integral centile-norm scores: 93; 85; 80; 74; 70; 66; 62; 58; 54; 49; 42; 37; 29 


CHAPTER 7 


THE NORMAL DISTRIBUTION CURVE 


Repeatedly have sets of measurements in psychology and education yielded 
frequency distributions that resemble the bell-shaped normal, or Gaussian, 
curve. Because the normal curve has so many useful mathematical proper- 
ties, it is quite natural that we should exploit those properties ih dealing with 
psychological and educational data. Without the use of the Gaussian curve 
and its convenient characteristics, many things that we now do with data 
would otherwise be impossible. It is important, therefore, that the student 
develop at least a moderate understanding of the normal curve in order that 
he may wisely apply the statistical procedures that depend upon it. 

Normality of Distribution Is Assumed. It must be confessed at the outset 
that no set of data ever obtained, whether they be measurements of a group 
of individuals with respect to some biological, psychological, social, or educa- 
tional trait or whether they be repeated observations of a single phenomenon, 
ever conforms exactly to the normal distribution pattern. Even though the 
larger population from which our sample came is perfectly normally dis- 
tributed (even this is probably never strictly true), sampling, no matter how 
extensive or representative it may be, is bound to give us some irregularities, 
with deviations from the normal form. Whenever, therefore, we treat our 
data as if they were normally distributed, or arose from a population that is 
normally distributed, we are assuming an ideal pattern for the sake of 
simplicity, rationality, and convenience. Sometimes we are more justified 
and sometimes less; we can never be absolutely sure, because the entire 
population is rarely or never measured, and the true shape of distribution is 
never known, 

We can justify our assumption of normality in several ways. One is the 
rational approach, which attempts to point out that the phenomenon we are 
measuring results from a number of independent causes occurring in chance 
combination, as in the tossing of coins or in the combinations of nonlinked 
hereditary genes. Very rarely is this kind of argument possible, because of 
our ignorance of underlying causes. Another kind of approach is empirical, 
in which we can show that, with the use of the measuring scale that we did 
use, the grouped data present a frequency distribution that obviously possesses 
a bell-shaped contour. Furthermore, there are statistical tests that can be 
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applied to show whether or not the frequencies we obtained deviate so much 
from the normal-curve picture as to cause us to reject our hypothesis that the 
data came by random sampling from a normally distributed population. 
Two Reasons for Caution. "There are two considerations, however, which 
should cause us to pause before making the hypothesis, or assumption, of 
normality. One has to do with the question of sampling and the other with 
the question of the correctness of our measuring scale. A population may 
well be normally distributed, yet because of our method of drawing cases for 
measurement we may obtain a skewed or otherwise distorted form of distribu- 
tion. This is a case of biased sampling. A large population of ten-year-old 
children would probably be distributed normally when measured for mental 
age. But if we confine ourselves to ten-year-old children in the fourth grade 


A 8 с 


Fic, 7.1. Showing how a test at three different levels of difficulty may yield distributions of 
raw scores differing markedly in skewness, regardless of the form of distribution of ability in 


the population. 


only, where most ten-year-olds are probably present because of mental 
retardation and a few for other reasons, the distribution of mental ages would 
be positively skewed. ‘The ten-year-olds in the sixth grade would probably 
yield a negatively skewed distribution, for the majority of them are acceler- 
ated by reason of precocity and a few for other causes. Both are cases of 
biased sampling. An unbiased, representative sampling would not confine 
itself to fifth-grade children, but would take ten-year-olds in correct ratios 
from all grades where they appear, would take them in correct proportions as 
to sex, economic status, and other factors considered significant. 

When a test or examination is used as the measuring instrument, the form 
of distribution of scores will depend upon many factors other than the form 
of distribution of the population. One of these factors is the level of difficulty 
of the test relative to the level of ability of the population. Even if the popu- 
lation is normally distributed in the ability measured, unless the test is of an 
appropriate level of difficulty a normal distribution of scores in a sample will 
not be obtained. If the test is too difficult, the distribution will be positively 
skewed, like that labeled A in Fig. 7.1. If the test is of moderate difficulty 
for the group, a symmetrical distribution like that labeled B will occur. If 
the test is too easy for the group examined, the distribution will be negatively 
skewed, like C in Fig. 7.1. Other degrees of skewing might occur. The 
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effect of skewing, when we are sure that the correct form of distribution should 
be symmetrical, may be regarded as a systematic distortion of the scale of 
measurement. The too difficult test tends to make the numerical units 
among the low scores stand for relatively large intervals of ability, and the 
too easy test to make the units among the high scores also stand for rela- 
tively large intervals. This principle should be clear from a study of Fig. 7.1. 

Other factors than difficulty may distort sample distributions. Later 
(Chap. 17) it will be shown how degree of reliability of scores may affect the 
form of distribution, causing tendencies toward sharpness of the rise in the 
center versus flatness, tendencies toward bimodality, and even U-shaped 
distributions. Another distorting factor may be the unsuitability of the scale. 
As was pointed out in an earlier chapter (Chap. 4), work-limit scores and 
time-limit scores tend to be reciprocals of each other. If the one kind of 
score ina task is normally distributed, the other will probably not be. 

These cautions kept in mind should serve to inhibit dogmatic assertions 
that might otherwise be made about the shape of a distribution. The shape 
of a distribution is always a function of the kind of measuring scale, and all 
conclusions that involve form of distribution should take this fact into 
account. The conviction that general populations are genuinely normally 
distributed with respect to most qualities is very strong, however, and so it is 
usually the marked deviation from normality in a sample that arouses ques- 
tions. We may then question either our method of sampling or our measur- 
ing scale. One or both of these factors may be responsible for the dis- 
crepancy. But when our sample distribution turns out reasonably normal in 
appearance, because of the conviction just mentioned we may feel some 
assurance that our sampling and our measuring scale are probably free from 
distortions, though of course we can never be certain of this. The conviction 
does lead us to apply the Gaussian curve in many useful ways, even in turning 
obtained scores into normally distributed measurements, as we shall see later 
(Chap. 19). We frequently feel that the risk in making the normal assump- 
tion is well worth while because of the invaluable results and conclusions it 
affords. We can always state our conclusions with the reservation that they 
are true to the extent that our assumptions are valid. As a matter of fact, all 
other conclusions should be couched in similar terms, for none is without its 
foundation of assumptions of one kind or another, whether stated or not. All 
scientific conclusions rest on assumptions, in the final analysis, and he who 
would know the import of those conclusions best is the one who knows those 
assumptions best. 


THE NATURE OF THE NORMAL Curve 


The relation of the Normal Curve to Probability, The Gaussian curve is 
also sometimes called the normal probability curve and is said to be the result 
of the “laws of chance.” Ina sense, thisis true. We cannot here go into an 
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involved discussion of probability and of the way in which the Gaussian 
curve is logically related to probability. It is sufficient for our present pur- 
poses to point out the usual example of how a normal distribution can be 
approximated by means of coin tossing. If we thoroughly shake a set of six 
coins and toss them to land where and how they may, the result can turn out 
in seven different ways; the number of heads can vary all the way from 0 to 6, 
In a total of 64 tossings, according to the principles of probability, we should 
expect the following frequencies for various numbers of heads: 


If we tossed the six coins twice as many times, we should expect these fre- 
quencies to be doubled. Actually obtained frequencies will deviate from 
these expected ones by small amounts. In one such experiment with 128 
tosses, the obtained frequencies were as given here: 


Obtained frequencie: * 
Expected frequencies 


This situation is shown graphically in Fig. 7.2, where the obtained frequencies 
furnish the basis for the histogram and the expected frequencies furnish the 
basis for the superimposed normal curve. 
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Fic. 7.2. A distribution curve representing the frequencies with which various numbers of 
heads are expected by chance in tossing six coins. Also shown, in histogram form, a 
frequency distribution of the obtained data from 128 tossings of six coins. 


A six-coin problem gives us a seven-sided frequency polygon (not counting 
the base line). A 10-coin problem gives us an 11-sided contour, etc., the 
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number of sides being equal to the number of coins plus one. If we do not ] 
enlarge the base line of our distribution but keep subdividing it into smaller 
and smaller units as we increase the number of coins, the contour of the 
distribution curve approaches the smooth bell form. The number of class 
intervals we choose in grouping obtained measurements has nothing to do 
with the number of coins, our choice being entirely arbitrary. The class 
intervals and their frequencies merely give us descriptions of the contour at 
points along the way. If there are things like coins in the phenomenon we are 
measuring (i.e., “coins” such as genes, which may be present or absent, or 
such as responses that do or do not occur) we almost always lack information 
as to how many such coins“ are operating. Probably there are a great 
many, although even if there were only six, as in the coin example, and if our 
measurements naturally fell therefore into seven class intervals, the normal 
distribution could still be roughly approached, as can be seen in Fig. 7.2. 

The Equation for the Normal Curve. Mathematically, when we are deal- 
ing with the properties of the normal curve, it is the situation with an infinite 
number of “coins” that we suppose. This enables the mathematician to give 
to the curve an equation that describes the relationship of a frequency to its 
corresponding measurement. This equation reads 


52 — 


5 (Equation for the Gaussian, or normal, curve) (7.1) 


where Y — frequency 

N = number of measurements 

с = standard deviation of the distribution 

T — 3.1416 

e = 2.718 (the base of the Napierian system of logarithms) 

æ = deviation of a measurement from the mean (or X — M) 
Since the values for т and e are known, if we substitute them in the equation, 
it becomes 

— 


М отв" 


Y = 750665 


For any distribution we may have at hand, we know the values for № and for 
т, and these can be inserted in their places in the equation. The equation 
would then be in a form with only Y and x the unknowns. * We could then 
assign certain values to x, within the range of our measurements, and solve the 
equation for the corresponding values of Y. In this way, we could determine 
the entire normal distribution curve that best fits our data. The arithmetical 
work would be ratherlaborious. Fortunately, we have the use of statistical 
tables to aid us in this. Table B, in Appendix B, is one well suited to this 
purpose. 
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Determining the Best-fitting Normal Distribution fora Set of Data. For 
the sake of an illustration that will help us to appreciate the meaning of the 
normal curve, let us find the expected frequencies in a particular instance, a 
distribution of 86 scores in a memory test. The best-fitting normal curve for 
any set of data has the same mean and standard deviation as those computed 
from the actual data. "The distribution of obtained frequencies of memory- 
test scores is given in column 7 of Table 7.1. The mean of this distribution is 
26.1, and the standard deviation is 6.45. Our task is to find the frequencies to 
be expected in the same class intervals for a normal distribution with a mean 
of 26.1, a standard deviation of 6.45, and an / of 86. 

Standard Measurements or Scores. In order to use equation (7.1) to find 
these frequencies, we must know how far each class interval deviates from the 
mean in terms of standard deviations. Each interval is given the value of its 
midpoint as its point on the score scale X. These X values are listed in 
column 2 of Table 7.1. Note that we have included one class interval beyond 


TABLE 7.1. OBTAINING THE EXPECTED FREQUENCIES f, IN THE CLASS INTERVALS 
FOR THE MEMORY TEST, ON THE ÁSSUMPTION THAT THE TRUE DISTRIBUTION 
Is NorMAL 


(6) 


te 
Expected | Observed 
frequency | frequency 


Each column of numbers is derived from the one preceding by the following computa- 
tions (see text for explanations): 

Column 3: x = X — 26.1. 

Column 4: z = K 6.45. 

Column 5: y comes from Table B. 

Column 6: f, = 40 X y. 
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the range of obtained scores at each end of the distribution. This is because 
the best-fitting normal curve usually has some small frequencies (perhaps 
fractional) in those extreme positions, even though the obtained frequencies 
there are zero. The equation for the normal curve calls for deviations rather 
than original scores—in other words, for X — М, or small x, for each class 
interval. These are listed in column 3. In this problem, each one is found 
by the solution of X — 26.1 for every interval. A simple check is to see that 
each one is three units (the size of the interval) distant from its immediate 
neighbors. The next step involves a new process, the determination of the 
standard measurement or standard score, for every interval. The standard 
score is given by the formula 


s= = care (A standard score or measure) (7.2) 

In the equation for the normal curve, it will be seen that the exponent of e, 
which is — / 20, can be written —(1)(x/c)*, or, in other words, it is one- 
half times the standard score squared. We shall find the standard score 
invaluable again and again. The statistical tables are constructed on the 
basis of standard scores. It matters not, then, what our original means and 
standard deviations are numerically. Reducing all raw scores to standard 
scores places them all on the same basis or common denominator. For our 
illustrative problem, the standard scores are given in column 4 of Table 7.1. 
Each number in column 4 is obtained by dividing the corresponding number 
in column 3 by 6.45, the standard deviation. 

Determining Frequencies for the Class Intervals. Having obtained the 
standard score for each class interval, we are now ready to look up the corre- 
sponding ordinate in the general statistical table, Table B. These are listed 
in column 5 of the worktable. The ordinates in this table are not exactly the 
frequencies we have been wanting to find. Those frequencies also depend 
upon X [see equation (7.1)). Table B is constructed on the assumption that 
N = 1, апіс = 1. For our distribution of 86 cases and a different с, we 
must make a certain adjustment. We must multiply each y value by a cer- 
tain number to find the expected frequency fe. The general formula is 


_ (iN (Expected frequency in a best-fitting 
f= (3) у normal distribution) (7.3) 


In this problem, 


When this multiplier is used with the numbers in column 5, the frequencies 

we desired are finally forthcoming, and they are given in column 6. 
Formula (7.3) may be made to appear reasonable if we look at it in the 

following manner. The expected frequencies (J.) must be of the order of 
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magnitude of the obtained frequencies (fo). The sum of the obtained fre- 
quencies is, of course, equal to N. The expected frequencies are, therefore, 
proportional to N, as formula (7.3) states. They must also be proportional 
to the size of class interval (i) because the larger the size of interval, the 
smaller the number of them, and, since they add up to N, the larger each fre- 
quency is. The appearance of ø in the denominator is not quite so easily 
explained. Itis best explained when we consider the equation for the normal 
curve. Ignoring the expression involving e (with its exponent) in equation 
(7.1), we find that Y is proportional to N/e4/2z. When we let both V and 
е equal 1, as is the case in the tables on the normal curve, y is proportional to 
1/ Mar. From this we see that the ratio of Y to y is equal to №/о. Thus, 
from another approach we can account for the presence of с in formula (7.3) 
as well as the presence of N. 

Comparing Obtained and Theoretical Frequencies. As a rough check upon 
all the work, we sum the expected frequencies, and the result should be very 
close to N but will usually be slightly less than M, because in the normal curve 
there are still fractions of frequencies even beyond the limits we have included 
here. Had we not gone one class interval beyond the obtained data, we 
should have lost .2 of a frequency at the upper end and .5 at the lower, and 
the sum would have been 85.2 instead of 85.9. As it is, we have still lacking 
only .1 of a case; not enough to worry about, and we may accept our check as 
one indication of correct work. A comparison of expected with obtained 
frequencies is always a rough check but is very rough, because we expect 
small discrepancies within class intervals. Looking down the columns, we 
find only one or two serious discrepancies. One is the difference between 
15.1 and 9, and the other is between 1.5and4. Both the obtained frequencies 
of 9 and 4 are out of line but are probably merely chance discrepancies, com- 
ing under the heading “errors of sampling,” and are no more serious than may 
be expected in a coin-tossing experiment. 

Plotting the Best-fitting Normal Curve. We could now use the expected 
frequencies as the basis of plotting the best-fitting, smooth, normal distribu- 
tion curve for the memory-test data. If plotting such a curve is our only 
objective, however, we have done some unnecessary work. A shorter pro- 
cedure for locating enough points for drawing the smooth best-fitting curve 
will now be explained. It follows precisely the same principles laid down in 
the previous discussion. But instead of being tied down to class intervals 
and their midpoints for our x values, we instead arbitrarily choose standard 
scores at convenient values .5e apart, as in the first column of Table 7.2. 


1 The customary way of determining whether the discrepancies between theoretical and 
obtained frequencies are so large as not to be attributed to sampling errors is to employ 
the chi-square test (see Chap. 11). The chi-square test, as applied to the normal-curve 


hypothesis, enables us to arrive at a decision as to the probability that an obtained set of 


frequencies is not normally distributed. 
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Since they are simple numbers, no interpolation will be necessary in using 
Table B. Since the positive standard scores duplicate the negative ones, 
half the work of looking up y values is obviated, unless one wishes to repeat 
the process asa check. The expected frequencies are again found by multi- 
plying y by iN/c, in this case, by 40. As before, this step is for the sake of 
obtaining frequencies in the proportions comparable with those obtained for a 
particular № (86), a particular о (6.45), and a particular size of class interval 
(3). 

The frequencies found in this manner will not correspond to midpoints of 
class intervals, however, but to other score-point positions on the scale. 
These points will be .5e apart, starting at the mean and going both ways. 
They correspond to the z scores given in the first column of Table 7.2. We 
need to find the corresponding X values for these z values. The first step 


TABLE 7.2. OBTAINING THE BEST-FITTING NORMAL CURVE FOR THE DATA ON THE 
Memory TEST FOR THE PURPOSE OF PLOTTING THE CURVE 


Expected 
frequency 


The numbers in the columns are obtained as follows: 
Column 1: Arbitrarily chosen. 

Column 3: 40 X y. 

Column 4: 6.45 X z. 

Column 5: x + 26.1. 


is to find the corresponding x deviations by the formula 
* = 20 (A deviation derived from a standard score) (7.4) 


These are shown in column 4 of Table 7.2. The X points corresponding to x 
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deviations can be found by the formula 
Х= MAN (A measurement estimated from a deviation) (7.5) 


which, in this problem, is X = 26.1 + х. The X values we want are shown 
in the last column of Table 7.2. 

Having these score points and their Corresponding frequencies, we can 
construct the graph shown in Fig. 7.3. The observed frequencies (f,) are 
also plotted as circlets to show where they fall with respect to the best-fitting 
normal curve. The reasonableness of the fit is rather obvious. It would 


Frequencies 


сл 


0 — 
5 10 20 25) 30; 35. 40 45 50 
Scores 
Fic. 7.3. The best-fitting normal-distribution curve for the memory-test data. Obtained 
frequencies are represented by circlets. The normal curve is (best- fitting“ in the sense 
that it has the same mean and standard deviation as the obtained distribution. 


probably have been not so easy to duplicate this normal curve by the smooth- 
ing process recommended in Chap. 3. We may say by way of general conclu- 
sion that if our obtained mean and standard deviation approximate closely 
the mean and ø of the population from which our sample came, and if the 
distribution for the population is normal, it looks like the curve in Fig. 7.3. 


AREAS UNDER THE NORMAL CURVE 


Perhaps the greatest usefulness of the normal curve lies in the relationship 
of the amount of area under the curve lying between certain limits on the 
baseline, In terms of mental-test scores, for example, this simply means the 
number or percentage of the cases to be expected between two score points. 
This is because the area under the curve represents the number or percentage 
of cases. The total area is equal to V, the total number of cases. But if we 
think in terms of a standard curve where N = 100, we can readily deal with 
percentages. For example, 50 per cent of the surface lies above the mean and 
50 per cent below. We can also think in terms of a standard curve whose 
total surface is equal to 1, or unity. In this instance we deal with propor- 
tions. The proportion of the area, or cases, lying above the mean is .5 and 
the proportion below is. S. The statistical tables are given in terms of a total 
area of 1, and the areas of certain segments are listed as proportions, but it is 
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just as easy to talk in terms of percentages. A percentage is a proportion 
multiplied by 100, and a proportion is a percentage divided by-100. Thus .46 
of the surface is 46 per cent; and 72 per cent of the cases is .72 of the surface, 
etc. 

Proportion of the Area between the Mean and Some Measurement or 
Score. We have already had occasion to say that the interval extending one 
standard deviation on either side of the mean includes about two-thirds of the 
cases. To say the same thing in another way, from the mean to plus 1c are 
to be expected about one-third of the cases, and from the mean to minus 10, 
another one-third of the cases. We can verify this by referring to Table B 
and looking up the proportion of the area between the mean and 1% (i.e., a z 
equal to 1.00). The area given to four decimal places is .3413, or 3,413 ten- 
thousandths of the area. If there were a normal distribution with 10,000 


*2c Bo 


Fic. 7.4, Different percentages of area under the normal curve within the various standard- 
deviation units on the base line. 


cases, 3,413 of them would be expected between the mean and 1c. In terms 
of percentage, it would be 34.13 per cent, or 34.13 cases in 100. The total 
interval from ＋ 1% to — 10 contains twice this area, or .6826, or 68.26 per cent. 
Figure 7.4 illustrates these facts graphically. We now see that this is a little 
more than two-thirds (which would be 66.67 per cent), but with small devi- 
ations from normality occurring on every hand we can afford to be so rough 
with our expectations as to give it as two-thirds. 

From Table B, we can also see that between the mean and a point 2c distant 
(either above or below, i.e., either +-2c or —2c), we should expect .4772 of 
the total surface, or 47.72 per cent of the cases. Included in the range from 
—2е to +2c, we should find twice this proportion, or .9544 of the area, or 
95.44 per cent of the cases. Out to Зе from the mean extends .4987 of the 
area, and in both directions from the mean to Зе we find twice this, or. 9974 of 
the area. Only 26 cases in 10,000 (10,000 — 9,974), therefore, should be 
expected beyond the range from 30 to 30 in a large sample. 

To take another example of a less special nature, how much of the area 
under the normal curve will be found between the mean and +0.78¢? From 
the table, we find this to be.2823. In still another problem, how many cases 
lie between the mean and —1.47¢? From the table, we find this to be .4292. 
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Figure 7.5 illustrates these two cases. It will be seen that the positive or 
negative sign of z merely tells us whether the area extends above the mean or 
below. The numerical size of z, whether positive or negative, determines the 
amount of area between the mean and the point. 

So far we have begun each problem of this type with some particular z or 
standard measurement. Let us start the problem a step or two further back 
and begin with some raw score or measurement. In the more practical case, 
we begin with X, notz. In the memory-test data, we may inquire what pro- 
portion of the cases come between the mean (26.1) and a point of 35 on 
the scale of measurement. This point deviates 8.9 units from the mean 


“1410 0 +0180 
Fic. 7.5. Proportions of the total area under the normal curve within certain standard- 
score limits on the base line. 


(X — M = -F8.9. This is the deviation x. The standard score s is , 
which equals 8.9/6.45 = +1.38. Everything must be transformed into stand- 
ard measure before the probability table may be utilized. Entering the table 
with a z of 1.38, we find the corresponding area to be .4162. In other words, 
41.62 per cent of the cases in a normal distribution would be found between 
the mean and 35 points on the scale. In the memory-test data, 41.62 per 
cent of 86 is 35.8, or, in whole numbers, 36 cases. Ina similar manner, which 
the student should verify, between the mean and a score of 20 are .3276 of the 
cases, or approximately 28. Between the mean and 15 are about 39 cases of 
the 86, and if we go on down to a score point of 5, we find 49.95 per cent of the 
cases. 

Special interest attaches to the question of the proportion of cases between 
the mean and a score of 30.45. It will be found that the standard score cor- 
responding to this is 0.6745. From the table we find that the proportion of 
the area to this point is .25, or exactly one-fourth. This case is illustrated in 
Fig. 7.6. In short, the point at 0.67450 corresponds to a distance of 10 from 
the mean. 

The Area above or below a Certain Point on the Scale. For a given deviate 
or standard score, Table B also gives us the proportion of the area above a 
certain point on the scale or below it. Above a point at +1¢ will be found 
.1587 of the area. This is found in column C of Table B, because when a. 
vertical line is erected at +1ø (see Fig. 7.7) it divides the total area under the 
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curve into two portions, the one above the line being the smaller of the two. 
Below the point 4-1c is the remainder of the area, or the larger portion (found 
in column B of the table), including .8413, or 84.13 per cent of the area. If 
we were interested in the point — 1с, the larger portion under the curve is now 
above the point of division and is found in column P, whereas the portion 


150 26.1 3045 Score scale 
4.2 O 4067450 Standard scale 


Fic. 7.6. Proportions of the cases to be expected between certain score limits in the memory- 
test data, on the assumption that the distribution is normal. 


below, being the smaller of the two, is found in column C. The situation is 
just reversed to the case where the division comes at ＋ 10. It is necessary to 
keep in mind in this kind of problem whether the area we wish to know is 
under the smaller end of the curve, all on one side of the mean, or whether it is 
under the larger side of the curve extending across the mean. 


Fic. 7.7. Proportions of the area above and below the standard score of lo and under the 
norma] curve. 


The proportion of the area above the point at +0.78% is in the smaller por- 
tion and, found in column С, itis.2177. The area below —-1.47е is also under 
the smaller portion of the curve and, from column C, we find that it is .070& 
(see Fig. 7.5). The area above the point —1.477 would be equal to 1.0 — 
-0708, which is 9292. Or it can be found from column B, since it occupies the 
larger portion under the curve, and this also givesus.9292. Or, from Fig. 7.5, 
we can see that it is the sum of the area from the point to the mean (.4292) 
plus .500, which gives the same result. 

In the memory-test data, where the mean is 26.1 and is 6.45, we may ask 
for the percentage of the cases to be expected below a score of 15. The 
deviation from the mean is 11.1. When this is divided by 6.45, we find that 
thez score is —1.72. Corresponding to a z of —1.72 is an area of .0427 in 
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the tail of the normal curve (see Fig. 7.6). We may expect 4.27 per cent of 
the cases "e a score of 15, or, out of 86, this would be 3.7 cases. Above a 
score of 15, we should expect the remainder of the cases, naturally, i.e., a pro- 
portion of .9573, a percentage of 95.73, and in number of cases, 82.3. Above 
a score of 30:45, which corresponds to а z score of +-0.6745, we should expect 
25 per cent of the cases. 

Area between Two Points on the Scale. The first case of this kind of 
problem has already been mentioned when we asked for the proportion of the 
area between —1с and +10 and the like. When the two score points are on 
two sides of the mean, it is simply a matter of summing the two areas between 
the mean and the two points. For example, between the points — 1.47s and 
40.780, we have the two areas .4292 and .2823 to add (see Fig. 7.5). The 
result is .7115, or 71.15 per cent. 

When the two points lie on the same side of the mean, it is a matter of sub- 
tracting the smaller area from the larger, more inclusive area. For example, 
the area between points at & le and +2c can be found by first obtaining from 
the table the area from the mean to 4-1e (which is .3413) and the area from 
the mean to +2ø (which is 4772). The area we seek is 4772 — .3413 = .1359 
(see Fig. 7.4). The area between points — 20 and — 30 would be the area 
.4987 (from Table B, column A) minus .4772 (from the same source). The 
difference is equal to .0215, which is illustrated in Fig. 7.4. 

The area between two raw-score points again involves the determination of 
z scores аз the first step. In the memory: test data, between scores 10 and 20, 
which correspond to z scores of —2.50 and —0.945, respectively, the area is 
the difference between .4938 and .3276, which is .1662, or 16.62 per cent. 
The areas from the mean to the two z scores are found as usual in Table B. 
As one more example from the same data, the proportion of the cases between 
scores of 30 and 35 is equal to .1888, for the z scores are +-0.605 and +1.38, 
respectively, and the area to the mean in the two cases .2274 and .4161. The 
student should verify these estimates. 

Points above or below Which Certain Proportions of the Cases Fall. The 
next problems reverse the processes that have just been described. Before, 
we were given points on the scale of measurement to determine areas; now we 
are given areas from which to determine points on the scale. For example, 
above what point in the normal curve does the highest 10 per cent of the c@ses 
come? Ten per cent is a proportion of .10. We could now use Table B in 
reverse, but it is much more convenient to utilize Table C, which gives the 
proportions in even steps. We are faced with a problem that gives the pro- 
portion in the tail of the curve, and so we look in the last column for C, the 
smaller area. We find the z score corresponding to it to be 1.2816. This will 
be with plus sign, since we are talking about the highest 10 per cent (see 
Fig. 7.8). Had we asked below what point does the lowest 10 per cent fall, 
the answer would have been —1.2816c. If the question is, “Above what 
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score lies the highest 80 per cent of the cases?” we are then dealing with the 
larger proportion under the curve; accordingly we look for the proportion of 
:80 in the first column of Table C. The corresponding z score is - 0.84 100 
(see Fig. 7.8). Had we asked for the point below which is the /owest 80 per 
cent, the answer would have been 4-0.8416. 

To apply these same questions to the memory-test data, we need go a step 
further and transform the z scores into terms of the raw-score scale. The 
highest 10 per cent come above a z of +1.2816. Multiplying this by ø (which 
is 6.45), we obtain the deviation (x) of +8.27. The mean (or 26.1) plus 8.27 
gives us a score of 34.37 points. The highest 10 per cent in a normal curve 
with mean of 26.1 апіс of 6.45 would come above the point 34.37. It hap- 
pens that this point comes close to the division point between two class inter- 
vals, or 34.5. In the actual distribution (see Table 7.1), 10 cases, or close to 


M. Highest 10% 
Ys 

-08460 0 +1,2816.0 Standard scale 
207 21 3435 Score scale 


Fra. 7.8. Score points above or below which certain percentages of the cases are expected 
in the memory-test distribution, assuming normality of distribution. 


12 per cent, were scores of 35 or above, which is good agreement. Ten per 
cent would have called for 8.6 cases, or 9 in whole numbers. 

The highest 80 per cent of the cases, which we found to come above a z score 
of —0.8416c, will be expected above a raw score of what? The deviation of 
this point from the mean is —5.43 points, or a score of 20.67. This comes 
close to another division point between class intervals, namely, 20.5. Inthe 
actual distribution, 71, or 82.5 per cent, of the cases are above a score of 20.5. 
Again the agreement between obtained proportion and expected proportion 
is quite close. To take one more case, which gives a point exactly between 
class intervals, we ask above what point are 93.2 per cent of the cases? The 
polt turns out to be a score of 16.5 points (the student should verify this). 
The actual percentage of cases above this score point is 92— again a very close 
agreement. 

Centiles and Corresponding z Scores. By now it may be apparent that 
we can look up in the tables the z score corresponding to any given centile. 
For example, фо is the point below which are 90 percent of the cases. Enter- 
ing Table C with .90 in column B, we find the corresponding z to be +1.2816. 
Corresponding to go is the z score of +-0.8416. We could find the raw-score 
points corresponding to all these z scores for any particular distribution. If 
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the assumpto of normal distribution is valid, this procedure would be an 
advance step over the recommendation of smoothed ogives for setting up 
centile norms. But if there is any noticeable skewing in the distribution, this 
procedure would be rather questionable. The smoothed-ogive method would 
leave the actual skewness taken into account. Since further measurements 
with the same test will probably yield the same kind of distribution from the 
same population, this deviation from normality should be represented in the 
norms. 

It can now be explained how, earlier (see Table 6.5), we arrive at the spac- 
ing of centile scores on the profile chart (Fig. 6.5). The values given to 
represent the spacing of the centiles are the z scores corresponding to them, 
and they were obtained as was explained in the preceding paragraph. The 
result is to normalize the distribution of all tests, whether the original measur- 
ing scale gave a normal distribution or not. There is, in other words, a 
general underlying assumption of normal distribution of the population in all 
the abilities represented in the profile chart. The most important gain in so 
doing is to transform measurements of all abilities into the terms of a common 
intelligible scale. 

The Points between Which Lie Certain Proportions of the Middle Cases. 
Among the problems involving area under the curve, there remains the case 
in which, given the area of a central group, what are the score limits of that 
group? The only practical case here occurs when the central group is evenly 
balanced on either side of the mean: the middle 50 per cent, 80 per cent, or 90 
per cent. Those groups, it will be remembered, are significant in connection 
with indicators of variability and are given distinction in the graphic device 
illustrated in Fig. 6.7. Here, however, we are talking about the best-fitting 
normal curve and not the original distribution. The middle 50 per cent 
extends from О to Qs, or from pəs to prs. Going to the tables with a propor- 
tion of .75, we find the corresponding 2 to be, as we should expect, 0.67450. 
The two points bounding this middle 50 per cent are —0.6745 and 4-0.6745. 
In the distribution of memory-test scores, these points would correspond to 
actual scores of 21.75 and 30.45. The interpolated Oi and O in this same 
obtained distribution were 21.00 and 30.85, respectively, or not very far from 
those estimated in the best-fitting curve. The middle 80 per cent extends 
from pio to poo. We have previously determined these to be at a distance of 
1.28160, minus and plus. The corresponding raw scores are 17.83 and 34.37, 
The interpolated 10th and 90th centiles are 17.1 and 35.3, again in close agree- 
ment. This kind of problem has really little application in psychological and 
educational statistics but is included for the sake of completeness and with 
the hope that it may lend further insight into the several ramifications of the 
normal distribution curve. All other problems having to do with area illus- 
trated above do have numerous and valuable applications, some of which we 


shall meet in Chap. 19. 
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Exercises. 


1. a. Tosssix pennies 64 times. After each throw, note and record the number of heads. 
Compare your obtained frequencies with the expected frequencies. Plot frequency 
polygons of the two distributions. Compute the mean and standard deviation of 
the distribution. 

b. Toss the same six pennies 64 times more, obtaining a new set of data like the first, 
Compute the mean and standard deviation of this distribution, and make compari- 
sons with the first obtained distribution and with the theoretical distribution. 

c. Combine the two distributions into a single one. Are the frequencies now any 
nearer the expected ones? Compute the mean and standard deviation. Are 
they any nearer the mean and standard deviation of the theoretical distribution? 

d. One more experiment may be tried in which some of the outcomes with a small 
number of heads are not counted, but another throw is immediately substituted. 
Every second case in which at a glance you can tell the number of heads is small 
should be ignored and the trial repeated. Again, obtain 64 record trials. This 
situation illustrates a biased sampling. What is the effect upon the frequencies? 

e. What would happen in another set of trials if one penny were left head up, only 
the remaining five being thrown each time but all six coins being observed and all 
heads being counted? 

2. Determine the standard scores for all the midpoints in the distribution of Data 7A. 

Also determine the z scores for the following raw scores: 40 55 72 85 95. 


Data 74. DISTRIBUTION OF SPELLING-TEST SCORES IN A SUPERIOR GROUP 
or FRESHMEN* 


Scores f 
82-85 1 
78-81 8 
74-71 8 
70-73 5 
66-69 34 
62-65 21 
58-61 39 
54-57 32 
50-53 20 
46-49 7 


* The test was one of the Cooperative series, and the scores are T scores (see Chap. 19). 


3. From Table B, determine the ordinate value at each midpoint of distribution 74. 

4. Find the expected frequency for each class interval, and tabulate them and the 
observed frequencies in parallel columns. State some inferences that you can draw from 
your results. 
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5. Find the best-fitting normal curve for Data 74 after the manner of Table 7.2. Plot 
the curve along with the obtained frequencies; 

6. Find the proportions of the areas under the normal curve between the mean and the 
following z scores: —2.15 —1.85 —0.19 +0.375 +11 +3.52. 

7. Find the proportions and numbers of cases to be expected between the mean and the 
following scores in Data 74: 35 45 60 65 75 58.35. 

8. Find the proportions of the area above the following z scores: 4-2.15 +1.62 
+0.175 —0.36 -19 —2.8; also below the following z scores: —3.80 —1.225 

—0.6745 +0.05 +175 T23. 

9. Find the proportions and numbers of cases to be expected in distribution 74 above 
the following score points: 80 55 65 69.5 54,5 41.5; also below the 
following score points: 85 45 56 77.5 51.5 61.5. Whenever possible, 
compare expected with obtained frequencies. 

10. Find the proportions of the area falling between z scores: — 1.50 and +1.25 —0.05 
and +2.70 +0.55 and +0.95 —2.70 and —1.15 +1.15 and +2.90 +1.25 
and —0.35. 

11. Find the proportions and numbers of cases to be expected in distribution 74 between 
the score points: 70 and 80 38 and 45 45and65 69.8 and 61.5 45.5 and 53.5 
57.5 and 65.5. Whenever possible, compare expected with obtained frequencies. 

12. Give in terms of standard measurements the points above which the following per- 
centages of the cases fall in the normal distribution: 85 55 35 42.3 66.7 
9.4. 

13. Give the s score below which the following proportions of the casesfall:.14 ^ .62 
375 418 .129. 

14. Above what scores in distribution 7A will the following percentages of the cases be 
expected: 12, 54, $4.13, 5.75, and 68.4 per cent? 

15. Below what scores in distribution 7A should we expect the following number of 
cases: 11 63 89.5 123 1627 Compare expected with actual cumulative 


frequencies. 
16. What z scores correspond to the following centile ranks: 75 62.5 16.7 5 


99? 
17. Between what score limits in distribution 74 should we expect the middle 80 per cent 


of the cases? The middle 50 per cent? The middle 90 per cent? Compare these with 
the interpolated limits for these same percentages. 


Answers 
2. в at midpoints: +2.67; +2.19; +1.71; +1.24; +0.76; +0.29; —0.19; —0.67; —1.14; 
—1.62; —2.10; —2.57; —3.05. 
Selected z scores: —2.51; —0.73; +1.30; +2.84; +4.04. 
3. Ordinates (y): (.003); .011; .036; .092; .185; .298; .383; .392; .319; .208; .108; .044; 
015; 004. 
4. fe: (0.2); 1.0; 3.1; 7.8; 15.8; 25.4; 32.6; 33.4; 27.2; 17.7; 9.2; 3.8; 1.2; 0.3. 
5. J. 0.4; 1.5; 4.6; 11.0; 20.6; 30.0; 34.0; 30.0; 20.6; 11.0; 4.6; 1.5; 0.4. 
6. p: 4842; 4678; .0753; .1461; .3643; .4998. 
T. p: 4990; .4716; .0521; .1787; 4510; .1282. 
f: 89.3; 84.4; 9.3; 32.0; 80.7; 22.9. 
8. р above: .0158; .0527; .4306; .6405; .9713; 9974. 
p below: .00007; .1104; .2500; 5199; .9599; .9893. 
9. p above: .0122; .7660; .3214; .1587; .7840; .9902. 
f above: 2.2; 137.1; 57.5; 28.4; 140.3; 177.2. 
p below: .9977: .0276; .2720; .9745; .0098; .5191. 
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f below: 178.6; 4.9; 48.7; 174.4; 1.8; 92.9. 
10. p: 8276; .5164; .1201; .1216; .1232; .5312. 
11. p:.1325; .0274; .6503; .3222; .1511; .3658. 

f: 28.7; 4.9; 116.4; 57.7; 27.0; 65.5. 


12. z: —1.0364; —0.1257; +0.3853; --0.1942; —0.4316; 4- 1.3094. 


13. s: — 1.0803; 4-0.3055; —0.3186; —0.2070; --0.6098. 

14. 2: +1.1750; —0.1004; — 1.0000; +1.5765; —0.4789, 
X: 71.0; 60.3; 52.7; 74.3; 57.1. 

15. X.: 48.1; 57.9; 61.1; 65.2; 72.0. 

fe: 11; 63; 89.5; 123; 162. 

fo: 9; 67; 98; 121; 160. 

16. s: +0.6745; +0.3186; —0.9661; —1.6449; 42.3268. 

17. Expected limits: 50.3 and 71.9; 55.4 and 66.8; 47.3 and 74.9. 
Interpolated: 49.4 and 73.0, 55.2 and 66.8; 48.3 and 77.5. 
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CHAPTER 8 


CORRELATION 


No single statistical procedure has opened up so many new avenues of dis- 
covery in psychology and education as that of correlation. This is under- 
standable when we remember that scientific progress depends upon finding 
out what things are co-related and what things are not. A coefficient of corre- 
lation is a single number that tells us to what extent two things are related, to 
what extent variations in the one go with variations in the other. "Without 
the knowledge of how one thing varies with another, we should find predic- 
tions impossible. And wherever causal relationships are involved, without 
knowledge of covariation, we should be unable to control one thing by 
manipulating another. 

For example, when we know that the higher a girl's score in a clerical- 
aptitude test, the higher the average performance she is likely to exhibit after 
training, we can thereafter use scores on this test to predict leve! of pro- 
ficiency. We say that there is a high positive correlation between aptitude- 
test score and clerical success. We discover this fact by finding a coefficient 
of correlation between scores of a number of girls and measures of clerical 
performance later for the same girls. We can never compute a coefficient of 
correlation on one person alone, nor can we compute it without having made 
two sets of measurements on the same individuals, or on matched pairs of 
individuals. In this instance, if we consider that the aptitude test has meas- 
ured individual differences in some quality or qualities that lead to success, 
i.e., in the sense of a cause of clerical success, then we can not only predict 
future success for individuals but also promote high general efficiency in any 
group of clerks by selecting those with high scores. Thus are studies leading 
to prediction and control of human affairs promoted because correlation 
techniques are available. Without some device like this for checking up on a 
test, we have only vague notions concerning its effectiveness, unless, indeed, 
its effectiveness is so obvious to direct observations as to require no inspection 
by correlation methods, which is highly unlikely. 


THE MEANING OF CORRELATION 


Some Examples of Correlation between Two Variables. The coefficient of 
correlation is one of those summarizing numbers, like a mean or a standard 


deviation, which, though it is a single number, tells a story. It can vary from 
135 
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a value of +1.00, which means perfect positive correlation, through zero, 
which means complete independence or no correlation whatever, on down to 
—1.00, which means perfect negative correlation. 

A Case of Perfect, Positive Correlation. Figure 8.1 illustrates an instance of 
perfect positive correlation. It is a fictitious case, for such exact agreement 
between two things is rarely or never experienced, certainly not in psychology 
oreducation. Here we have assumed two tests, X and Y. Ten individuals 
have received scores in the two tests. The pairs of scores are as follows: 


Looking down the rows of scores, each pair made by one individual, we readily 
conclude that each person's score in Y is two points higher than his score in X. 


0 
024 6 8 0 2 M 6 Oz в 000 5 
A Test X Test X 
Fro. 8.1. A simple correlation chart illus- Frc, 82. A correlation chart illustrating 
trating the kind of relationship between X the kind of situation when the correlation is 
and Y scores when the correlation is +1.00. +.76. . 


In terms of a simple equation, Y = X -- 2. There are no exceptions, which 
makes the correlation perfect. 
To take another instance: 


Individual. 


Score in test P.. 12 15 


Score in test O. 


In this situation, each person's score in Q is two times that in P, again without 
exception; there is perfect 3 and the coefficient of correlation would 


be +1.00. The equation for predicting Q from P is 0 = 2Р. 
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A Case of High Positive Correlation. In Fig. 8.2, we have illustrated a case 
of correlation that is positive but less than +1.00, The graphic picture of 
the individuals shows that, in general, a person who is high in test X is also 
high in test Y, and one who is low in X is also likely to be low in У. The 
actual scores for these 10 people are listed in the first two columns of Table 8.1. 
It will be seen that although the individuals are arranged in rank order for 
scores in X, there are some deviations from this rank order when we inspect 
their scores in Y. The coefficient of correlation by computation is equal to 
-F.76. We shall soon see how this was obtained, but first simply note by 


074759 11 


Test X 
Fro, 8.3. An example of a correlation chart Fio, 84. An example of a correlation chart 
when the correlation is only +.14. when the correlation is —.69. 


comparison of Figs. 8.1 and 8.2 how the individuals are scattered in the dia- 
grams. In Fig. 8.1, they line up in perfect file from lowest to highest. In 
Fig. 8.2, they tend to fan out or to diverge from a strict line-up, but a definite 
trend of relationship can be observed. The amount of spreading in Fig. 8.2 
as compared with that in Fig. 8.1 (in which it is, of course, none) illustrates 
the difference between correlations of +1.00 and 4-.76. 

A Case of Low Positive Correlation. A third instance is shown in Fig. 8.3, 
in which the spreading effect to which our attention was called before is even 
greater, The coefficient of correlation here is 4-.14, in other words, close to 
zero. This being true, a person with high score in X is likely to be almost 
anywhere, within the total range, in terms of his Y score. The three highest 
people in X, with scores of 10, 12, and 13, scatter all the way from 3 to 11 in 
test Y. The three lowest people in test X, with scores of 1, 3, and 4, scatter 
all the way from 2 to 9 in test У, Although there is a trace of relationship 
between X scores and Y scores, it is very weak. The actual scores may be 
compared in Table 8.3. 

A Case of High Negative Correlation. The situation that obtains when 
there is a negative correlation is shown in Fig. 8.4. Here the coefficient is 
—.69. Compare this diagram with that in Fig. 8.2, and it will be apparent 
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that the trend of the points is along the other diagonal now, from upper left 
to lower right. This illustrates the fact that persons making high scores in X 
are likely to make low scores in F, and persons making low scores in X are 
likely to make high scores in V. This inverse order of relationship is also 
apparent in the actual scores in the first two columns of Table 8.2. The 
numerical size of the coefficient (.69) is nearly the same as for the correlation 
in Fig. 8.2 (.76). It will be seen that the width of scatter of the points is 
about the same in the two cases. A perfect negative correlation would be 
pictured as a line of dots like that in Fig. 8.1 but it would slant downward 
instead of upward from left to right. The algebraic sign of the coefficient of 
correlation therefore merely has to do with the direction of the relationship 
between two things, whether direct or inverse, and the size of the coefficient 
(distance from zero) has to do with the strength, or closeness, of the relationship. 


How to COMPUTE A COEFFICIENT OF CORRELATION 


The Product-moment Coefficient of Correlation. The standard kind of 
coefficient of correlation and the one most commonly computed is Pearson's 
product-moment coefficient. The basic formula is 


Lov Хху (Basic formula for a Pearson product-moment coefficient (8.1) 
* 7 Now, of correlation) ы 


where rz, = correlation between X and У 

x = deviation of any X score from the mean in test X 

y = deviation of the corresponding Y score from the mean in test Y 

Ty = sum of all the products of deviations, each x deviation times its 
corresponding у deviation 
7, and т, = standard deviations of the distributions of X and V Scores 
s Steps necessary are illustrated in Table 8.1. They will be enumerated 
ere: 


Step 1. List in parallel columns the paired X and Y scores, making sure that 
corresponding scores are together. 

Step 2. Determine the two means M, and M,. In Table 8.1, these are 7.5 
and 8.0, respectively. 

Step 3. Determine for every pair of scores the two deviations xand y. Check 
them by finding alzebraic sums, which should be zero. 

Step 4. Square all the deviations, and list in two columns. This is for the 
purpose of computing c; and c,. 

Step 5. Sum the squares of the deviations to obtain Dx? and Sy? 

Step 6. From these values compute о; and Oy: 

Step 7. For every person, find his xy product (last column of Table 8.1). 
Sum these for Хху. 
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‘Taste 8.1. CORRELATION BETWEEN Two Sets OF MEASUREMENTS OF THE SAME 
INDIVIDUALS; UNGROUPED Data; PRODUCT-MOMENT COEFFICIENT OF CORRELATION 


P4 Ү x y xt у? ху 

13 11 T5.5 T3 30.25 9 +16.5 

12 14 +4.5 +6 20.25 36 +27.0 

10 11 +2.5 +3 6.25 9 +75 

10 7 +2.5 -1 6.25 1 — 2.5 

8 9 40.5 +1 0.25 1 + 0.5 

6 11 -1.5 +3 2:25 9 — 4.5 

6 3 -1.5 -5 2.25 25 + 7.5 

5 7 —2.5 -1 6.25 1 + 2.5 

3 6 —4.5 —2 20.25 4 + 9.0 

2 1 —5.5 -7 30.25 49 +38.5 

Sums... 75 80 0.0 0 124.50 144 102.0 
Means.. 7.5| 8.0 Ix zy Уху 


а = Mn = V12.450 = 3.528 


10 
oy = VIII = V14.4 = 3.795— 
Zxy 102.0 1020 _ +76 


ere) © (10)(3.53)(3.79) 138.50 
An alternative solution without computing the c's: 


xy 102.0 102.0 102.0 
Tau = AA) ^ у (124.5)(144) ^ \17,9280 ^ 133.90 — 


+.76 


Step 8. We are now ready for formula (8.1). In the illustrative problem, 
the arithmetic is given following Table 8.1. 


A Shorter Solution. There is an alternative and shorter route that omits 
the computation of o and gy, should they not be needed for any other purpose. 
The formula is 

Zxy 
VG) (y?) 


The solution with this formula is also given with Table 8.1, and it leads to the 
same coefficient. In both cases, two significant digits have been saved in r, 
for the reason that for so small a number of cases the sampling error in r is so 
relatively large that more than two digits would be rather deceiving as to 
accuracy. When NM is large—200 or more—three-place accuracy in 7 may 
more properly be reported. 

Computing a Negative Coefficient. As another example of the computation 
of r, when the correlation is negative, Table 8.2 is presented. The operations 


(Alternative formula for a Pearson r) (8.2) 


Tay 
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are just the same, step by step. The only thing new is the care that must be 
taken with algebraic signs. Я 


TABLE 8,2. А NEGATIVE CORRELATION IN UNGROUPED DATA BY THE 
PRoDUCT-MOMENT METHOD 


x F. - у 51 у ху 
12 7 +5 —1,5 25 2.25 — 7.5 
10 3 +3 =5.5 9 30.25 —16.5 
9 8 42 —0.5 4 .25 — 1.0 
8 5 +1 —3.5 1 12.25 — 3.5 
7 7 0 —1.5 0 2.25 0.0 
7 12 0 T3.5 0 12.25 0.0 
6 10 —1 +1.5 1 2.25 -1.5 
5 9 —2 +0.5 4 .25 — 1:0 
4 13 -3 +4.5 9 20.25 —13.5 
2 11 —5 +2.5 25 6.25 —12.5 
Sums... 70 85 0 0.0 78 88.50 —57.0 
Mean... 7.0} 8.5 >й zy Уху 


о. = УТ = VTB = 2.79 
88.5 
DES V = V885 = 2.97 


йд дз re SO 
"аи = (10)(279) (2.97) ~ 82.863 
= —.69 


Computing r from Original Measurements. In both examples thus far, we 1 
have been dealing with a small number of observations and ungrouped data. 
When the data are more numerous, we resort to grouping into class intervals; 
but first let us see another procedure with ungrouped data, which does not 
require the use of deviations. It deals entirely with original scores. When 
Taw scores are small numbers or when a good calculating machine is available, 
this is the best procedure. The formula may look forbidding but is really 
easy to apply: 


та N ZXY —(zO(zv) — 7^" pen m (8.3) 
* VINEX = GXP|NIPS- RYE) шаш eS 


where X and Y are original scores in variables X and Y. Other symbols tell 
what is done with them. We follow the steps that are illustrated in Table 8.3. 


Step 1. Square all X and Y measurements. 

Step 2. Find the XV product for every pair of scores. 

Step 3. Sum the X’s, the Vis, the Xs, the Y?'s, and the XY's. 
Step 4. Apply formula (8.3). 
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The author has found it more convenient, particularly when machine work 
can be done, to compute 7*,, first by the formula 


[NzxY - (zX)(zY)| 


a = (уут (ХУУР (ST) (8.4) 


and then finally extract the square root to find m as shown just below 
Table 8.3. 


TABLE 8.3. CORRELATION OF UNGROUPED DATA COMPUTED FROM THE 
ORIGINAL MEASUREMENTS 


r, = L'INZXY = (zY) zv 
^ = NIX? — (BX) NEY? (27) 
= (4,720 — 4,550)* 
(6,240 — 4,900)(5,330 — 4,225) 
1 
(1,340) (1, 105) 
_ 28,900 
1,480,700 
= 019518 
Tey = У .019518 
= 4.14 


Preparing a Scatter Diagram. When N is large, even when N is moderate 
in size, and when no calculating machine is available, the customary pro- 
cedure is to group data in both X and Y and to form a scatter diagram or 
correlation diagram. The choice of size of class interval and limits of inter- 
vals follows much the same rules as were given in Chap. 3. For the sake of a 
clearer illustration of the procedure, a smaller number of classes will be 
employed in the problem now to be described. The data were scores earned 
by a class in educational measurements in two objectively scored examina- 
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tions, one of which stressed statistical methods and the other of which stressed 
tests and measurements. 

In setting up a double grouping of data, a table is prepared with columns 
and rows—columns for the dispersions of Y scores within each class interval 
for the X scale, and rows for the dispersions of X scores within each class 
intervalforthe Y scale. Along the top of the table (see Table 8.4) are listed 
the score limits for the class intervals in test X. Along the left-hand margin 
are listed the score limits for the class intervals in test У. We make one tally 
mark for each individual's X and Y scores. For example, if one individual 
had a score of 83 in test X and a score of 121 in test Y, we place a tally mark 
for him in the cell of the diagram at the, intersection of the column for interval 
80-84 in X and the row for interval 120-124 in T. All other individuals are 
similarly located in their proper cells. 


TABLE 8.4. A SCATTER DIAGRAM OF THE Scores IN Two ACHIEVEMENT TESTS 
X: Scores in First Achievement Test 


Eum 
| 120-124 1 1 Vom 
125-129 7 17 2 | 
i : me Е 1 It 2 1 
120-124 [ 4 Ma4 шу, iB 17 
Ее 
115-119 n 1 5 CORRER 22 


110-114 
105-109 


| 100-104 


95-99 


I3] КЫКЕ 5 


Y: Scores in Second Achievement 


When the tallying is completed, we write the number of cases, or the cell 
frequency, in each of the cells, Next we sum the cell frequencies in the rows 


when we have knowledge of the correct frequency distribution of Y orof X 
from any other source. There are times when it is wise to do the entire tally- 
ing two times and to compare all cell frequencies in the two attempts, It is 
very easy to place a tally mark in the wrong cell. 
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Computing the Pearson r from a Scatter Diagram. When the product- 
moment r is computed from a scatter diagram, the formula becomes 

Ixy 

A 


ty = GG (Pearson r from grouped and coded data) (8.5) 


where x’ and у! = deviations of the coded values for X and Y from their 
respective means 
M and My = means of coded values x’ and /, respectively 
сг and oy = standard deviations of coded values x’ and y’, respectively 

The correlation between X and Y is identical with that between the coded 
values x’ and y’; hence formula (8.5) gives us the correlation 7 without any 
need for decoding. The details of application of this equation will now be 
explained and illustrated. 

Computing the Standard Deviations. From Table 8.5 we have all the 
necessary information for applying formula (8.5): 


My = N = 57 230 
xy 30 _ 
My = AY = FE = 345 


3 — | = My = le — .0529 = »/2,3149 = 1.52 
s JZ- My = JZ- 4190 = 4/24557 = 1.57 


Determining the Sum of the Cross Products. The new process to be mastered 
here is the calculation of the cross products, or products of the moments, and 
their sum, in other words, Ex'y'. It is best to begin with the idea that every 
cell has its own ^ product and to keep that idea in mind. In fact, it is well 
to determine the 2/4 product for every cell in which individuals fall and to 
write it in, as was done in Table 8.5. 

The x’y’ product for any cell is simply the product of the æ value times the 
value of that cell, close watch being kept of algebraic signs. This matter is 
easily checked, of course, by making sure that the sign of every x’y’ product 
is positive in the upper right quarter of the chart and also the us left 
quarter, but that they are all negative in the upper left and lower right quar- 
ters. This rule presupposes that the X measurements are increasing from 
left to right and that the Y measurements are increasing from below upward. 

Having given every cell its x’y’ value and having recorded it in the upper 
left-hand corner of the cell, we next note how many individuals have that 
z'y' value—in other words, the frequency in that cell. We multiply the cell 
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product by the frequency, and in Table 8.5 these products are recorded with 
algebraic sign in the lower right-hand corners of the cells. All that remains 
now іѕ to summate them. We do this both in the columns and in the rows for 
the sake of checking, for this is an unusually critical number in the correlation 
formula, and because of the many steps involved in deriving it there are many 


TABLE 8,5. SCATTER DIAGRAM FOR COMPUTING A PEARSON 7 


5 

Mm — еа |а] 

. 
. 

Ыы L 

| . сава 

105-109 

Sm rand ШИ ЫП БЕЙ ЕЛЕДЕ 

ЧЕЛИ таве EN E 

= 2 

| ERE ER RE 

SS 

Ixyy'* 

1 


opportunities for errors. The last two columns in Table 8.5 are devoted to 
the sums of fx’y’ values in the rows. We keep the sums of the positive 
products in one of these columns and the sums of the negative products in 
the other. The last two rows of the table are reserved likewise for summing 
the positive and negative sums in the columns. Summing everything in the 
last two columns (also in the last two rows) of the table gives us Z, and 
the two estimates should check exactly. For the illustrative problem, the 
positive sum is 134 and the negative is —14, leaving а net positive sum a^ 
of 120. We now have everything we need for calculating z. Applying 
formula (8.5), we have 


120 
2 aL (.23)(—.345) 
м (1520.87) - 
_ 1.3793 + .0794 
23864 _ 
1.4587 
= 2.3861 
-.61 
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INTERPRETATIONS OF A COEFFICIENT OF CORRELATION 


How High Is Any Given Coefficient of Correlation. Any coefficient of 
correlation that is not zero and that is also statistically significant denotes 
some degree of relationship between two variables. But we need further 
orientation on the matter, for the strength of relationship can be regarded 
from a number of points of view, and it is not correct fron any one of these 
points of view to say that the degree of relationship is exactly proportional 
tor. The coefficient of correlation does not give directly anything like a per- 
centage of relationship. We cannot say that an 7 of .50 indicates two times 
the relationship that is indicated by an r of .25. Nor can we say that an 
increase in correlation from r = .40 to r = .60 is equivalent to an increase in 
correlation from r = .70 to .90. The coefficient of correlation is an index 
number, not a measurement on a linear scale of equal units. 

A General Verbal Description of Coefficients. Our interpretation of the 
size of r depends very much upon what we propose to do with it or the reasons 
why we computed it. What would be a large correlation coefficient for one 
purpose would be regarded as a small one for another. Interpretation is 
therefore largely a relative matter, relative to the area of investigation in 
which we are working and to other factors. But taking correlations just at 
large, without particular regard to their use and as a general orientation, we 
may say that the strength of relationship can be described roughly as follows 
for various 7°: 


Less than .20.......... Slight; almost negligible relationship 
2040. % Low correlation; definite but small relationship 
40.70 he Moderate correlation; substantial relationship 
410-90. „ШАА High correlation; marked relationship 
901,00 > 6307. Very high correlation; very dependable relationship 


It should be said that the coefficients should be interpreted as stated only 
when, by comparison with the standard error of r, they prove to be significant. 
It should also be said that the same interpretations apply alike to negative 
and positive r's of the same numerical size. Ап of —.60 indicates just as 
close a relationship as an r of +.60. 

Particular Uses Have a Bearing on Interpretation of r. The general 
descriptive list just given should be qualified by making references to particu- 
lar uses of r. One common use is to indicate the agreement of scores on an 
aptitude test with measures of scholastic or of vocational success. Such a 
correlation is known as a validity coefficient. It is an index of the practical 
validity of a test. Chapter 18 will deal extensively with this subject. Com- 


1 For a treatment of the topic of statistical significance of a coefficient of correlation, 
see Chap. 9. 
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mon experience shows that the validity coefficient for a single test may be 
expected within the range from .00 to 60, with most of them in the lower half 
ofthat range. Validity coefficients for composite scores based upon combi- 
nations of several different kinds of tests are likely to be distinctly higher, 
ranging up to .80 in rare instances but hardly ever above the latter figure. 
Many who have employed tests for vocational guidance or vocational selec- 
tion have followed a tradition which may be credited to C. L. Hull! some 30 
years ago, that the minimum validity coefficient for a test of practical useful- 
ness is about .45. Recent experiences have shown that this standard is too 
rigid and that there are many considerations other than validity which deter- 
mine the usefulness of a test in any given situation, as will be shown in 
Chap. 15. 

It is well recognized that a reliability coefficient, which in very general terms 
is а correlation of a test with itself, is usually a much higher figure than a 
validity coefficient, Following the leadership of T. L, Kelley,? there has been 
a general tradition that, to be sufficiently reliable for discriminating between 
individuals, a test should have a reliability coefficient of at least .94. Some 
have been more liberal in this regard, allowing a minimum of .90, while others 
have been more demanding, with a requirement of a minimum of .96. These 
standards are rarely attainable, and it is safe to say that most tests in use fail 
to meet them. As a matter of fact, there are many very useful tests whose 
reliability coefficients are in the .80's and even below. It is coming to be 
recognized that validity is much more important than reliability, and, in fact, 
it is possible for a test to be sufficiently valid for practical purposes without 
being very reliable, Tests with reliability coefficients as low as. 35 have been 
found useful when utilized in batteries with other tests Such tests have 
been known with validities as high as .35. They could theoretically have 
validities much higher than that, Reliability and validity depend upon 
many considerations that we cannot go into here. These problems will be 
treated in Chaps. 17 and 18, It is sufficient to say that one must be a 
relativist when dealing with problems of test reliability and validity. The 
student's interpretation of a coefficient of correlation, like his interpretation 
of other statistics, is subject to considerable revision as he knows more about 
its uses. While these qualifications mentioned regarding reliability and 
validity need to be made, the fact remains that in practice we expect reli- 
ability coefficients to be in the upper brackets of r values, usually .80 to .98, 
and validity coefficients to be in the lower brackets, usually .00 to. 80. 

When one is investigating a purely theoretical problem, even very small 


L на C. L. LM Testing. Yonkers, N. V.: World, 1928. Chap. 8. 
Kelley, Т. nier pretation of Educational Measurements. V. V.: 
a ae onkers, N.Y.: World, 
ies. J. P. New standards for test evaluation. Educ. psychol. Measmi,, 1946, 
6, 427-428. 
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correlations, if statistically significant (undoubtedly not zero), are often very 
indicative of a psychological law. Whenever a relationship between two 
variables is established beyond reasonable doubt, the fact that the correlation 
coefficient is small may merely mean that the measurement situation is con- 
taminated by many things uncontrolled or not held constant. One can 
readily conceive of an experimental situation in which, if all irrelevant factors 
had been held constant, the 7 might have been 1.00 rather than .20. For 
example, the correlation between an ability score and scholarship is .50, since 
both are measured in a population whose scholarship is also allowed to be 
determined by effort, attitudes, marking peculiarities of the instructors, and 
what not. Were all the other determiners of scholarship held constant and 
were both aptitude and marks perfectly measured, the r would be 1.00 rather 
than.50. This line of reasoning indicates that where any correlation between 
two things is established at all, and particularly where there is a causal rela- 
tionship involved, the fundamental law implies a perfect relationship. Thus, 
in nature, correlations of zero or 1.00 are the rule between variables when 
isolated. The fact that we obtain anything else is because of the inextricable 
interplay of variables that we cannot measure in isolation. 

The practical conclusion from this is that a correlation is always relative to 
the situation under which it is obtained, and its size does not represent any abso- 
lute natural or cosmic fact. To speak of the correlation between intelligence 
and scholarship is absurd. One needs to say which intelligence, measured 
under what circumstances, in what population, and to say what kind of 
scholarship, measured by what instruments, or judged by what standards. 
Always, the coefficient of correlation is purely relative to the circumstances under 
which it was obtained and should be interpreted in the light of those circumstances, 
very rarely, certainly, in any absolute sense. 

How much faith one should place in any relationship shown by a coefficient 
of correlation also depends upon the urgency of the outcome. There are 
probably many medical treatments, such as some inoculations, vaccines, and 
the like, concerning which the knowledge is rather incomplete, which are 
administered even though the correlation between the treatment and living 
(or between nontreatment and dying) is of the order of .10 to. 20. Although 
the probabilities of living may be increased by only 1 per cent by the treat- 
ment, the saving of 1 life in 100 is regarded as worth the effort. If a pro- 
cedure in education promised only 1 per cent improvement over guesswork, 
we should pay little attention to it, because the seriousness of the outcome 
would not justify the means. It may be said in passing, however, that fail- 
ures to predict in vocational and educational practice are more generally 
recognized by reason of correlational checkup than are failures to predict in 
medical practice, where correlational checkup is less often made. In addition 
to the difference in relative seriousness of the outcomes of prescription in the 
two cases, this factor of better knowledge of goodness of results may be an 
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important reason for the higher standards of prescriptive accuracy demanded 
in education than are sometimes required in other fields. 

GRAPHIC REPRESENTATIONS OF CORRELATIONS 


In presenting the facts of correlation to the layman, who is probably not 
accustomed to thinking in terms of numerical indices in any case and who has 
probably never learned of the coefficient of correlation, it is better to convey 
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Fro, 8.5. Correlation between the pilot-aptitude score (pilot stanine) and the criterion of 
graduation-elimination from flying training in the AAF illustrated by a bar diagram, 
(Based upon Stamines: Selection and Classification for Air Crew Duly. Washington, D.C.: 
Headquarters, Army Air Force, 1946.) 


students in each stanine group is given for those who have some appreciation 
of the stability offered by large samples. 

The other diagram, Fig. 8.6, shows the average rating of flying proficiency 
made by cadets at each stanine level, and only the average. Some investi- 
gators connect successive pairs of points with lines, but in this particular 
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instance the linear trend is so clear that a straight line has been drawn by 
inspection to fit the trend. It is assumed that minor deviations that occur 
are due to sampling errors. A warning should be given in connection with 
this type of figure. It can give an impression of degree of correlation far in 
excess of that justified. Not shown are the widths of dispersions of indi- 
viduals, at different stanine levels, in this case. While the averages of col- 
umns do not deviate much from a straight line, many individual cases may 
deviate considerably. "There are ways of representing average discrepancies 
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Fic. 8.6. Correlation between pilot-aptitude scores and instructors’ ratings of flying pro- 
ficiency illustrated by means of a regression line that is based upon the averages of ratings 
for different aptitude-score levels. 


of individuals from such a regression line (see Chap. 15) which could be used 
to give the reader some idea of their seriousness. 


Assumptions UNDERLYING THE PRODUCT-MOMENT CORRELATION 


The student should be warned, before leaving this chapter, concerning the 
restrictions that should be observed in the use of the Pearson coefficient of 
correlation. The most important requirement for the legitimate use of the 
Pearson v is that the trend of relationship between У and X be rectilinear, in 
Other words, a straight-line regression. This can be determined, as a rule, by 
inspection of the scatter diagram. If the distribution of the cases within the 
correlation diagram appears to be elliptical, without any indications of a 
decided bending of the ellipse, the chances are that the relationship is recti- 
linear. Even if it is not, the deviation from a straight-line relationship may 
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be so slight that we may assume rectilinearity as a first approximation, and 
the degree of correlation indicated by r will be fairly close to any index of 
correlation, such as the correlation ralio (see Chap. 13), that is applied when 
there is curvature in the trend, When there is an obvious bending of the 
distribution of cases, a correlation ratio, or some other special coefficient, is 
indicated as the best index of correlation. 

There are in educational and psychological measurements certain factors 
that produce artificially curved scatters in the correlation diagram, This 
may happen when one or both distributions taken alone are badly skewed and 
the skewing is produced artificially by the faulty measuring scale, with its 
systematically shifting unit of measurement. If there is good reason to 
believe that this may be the case, one solution would be to normalize the 
skewed distribution by methods described in Chap. 19, When distributions 
are corrected for skewness, the curvature in the regression is frequently 
eliminated, and linearity is then obtained. If curvature still remains, then 
the Pearson r is not to be used to indicate the amount of correlation. 

There is nothing in what has been said to demand that the Pearson r is to be 
computed only with normal distributions, The forms of distributions may 
be various, so long as they are fairly symmetrical and unimodal; even rec- 
tangular ones would do. The important consideration is whether in all 
columns the dispersions are approximately equal, as indicated by the column 
standard deviations, and also in all rows. This condition goes by the name 
homoscedasticily, When columns (and rows) are relatively homoscedastic, 
we may compute a Pearson ғ, This condition will prevail generally when the 
two distributions are fairly symmetrical within themselves; thus we need not 
go so far as to compute standard deviations of columns and rows in order to 
find out It is when distributions are markedly skewed that significant 
departures from homoscedasticity occur. 

Figure 8.7 is presented to show graphically the kind of scatter plots one 
might expect when one or both distributions are symmetrical or skewed. In 
each diagram the form of distribution assumed is shown along the X or Y 
dimension. In diagram A both distributions are assumed to be normal. 
The probable scatter of the cases within the square area is elliptical. The 


not homoscedastic in either vertical or horizontal arrays (“array is a general 
term including both rows and columns). In diagram C, with skewing in the 
! Some writers vaggest. that only when both distributions are normal or nearly so will 


the conditions be fully satished for Д 
ната Computing а Pearson ғ. In practice probably по one 
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same direction in both X and У distributions, the regression appears to be 
rectilinear but the dispersion is not homoscedastic. In diagram D, the 
skewing is in opposite directions and there is neither rectilinearity nor homo- 
scedasticity. Only in the case of diagram A would one justifiably compute a 
Pearson product-moment coefficient of correlation. In a later chapter 
(Chap. 13) other types of coefficients of correlation will be described which 
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Fro. 8.7. Hypothetical forms of scatter plots in a correlation diagram when the forms of 
distribution of X and Y values differ. Diagram A shows linear regression and homoscedas- 
ticity; B and D show curved regression and lack of homoscedasticity; and C shows linear 
regression but lack of homoscedasticity. 


might be applied to the data in diagrams B, C, and D if one could justify the 
appropriate assumptions that must be made. 


Exercises 

1. Using the first 10 pairs of scores in the list in Data 84, compute a Pearson r between 
parts I and II. Use formulas (8.1) and (8.2). Find a similar coefficient, using the last 
10 pairs of scores in the same two variables, State your conclusions. 

2. Correlate the first 10 pairs of scores for parts IL and TII, using formulas (8.3) and (8.4). 
Correlate the same two parts, using the last 10 pairs and the same formulas, State your 
conclusions, 

3. Prepare a scatter diagram for the correlation of parts IIT and IV, including all 40 cases. 
Compute a Pearson r, using formula (8.5). State conclusions. 
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DATA 84. Scores EARNED BY 40 HIGH-SCHOOL STUDENTS IN SEVEN Parts OF 
THE GUILFORD-ZIMMERMAN APTITUDE SURVEY* 


Part I Part П | Part III | Part IV Part V Part VI Part VII 
Verbal Com- | Reason- | Numerical | Perceptual Spatial Spatial Mechanical 
prehension ing Operations Speed | Orientation | Visualization Knowledge 
22 11 24 29 27 39 30 
8 5 22 40 16 23 21 
19 6 44 36 14 12 21 
32 8 72 32 21 20 33 
13 2 25 46 25 20 29 
24 5 30 47 2 6 8 
22 4 38 49 15 37 35 
35 1 54 53 34 28 16 
18 7 37 51 37 46 30 
13 10 61 50 38 46 35 
53 23 56 45 22 41 38 
15 9 42 48 18 5 18 
34 18 30 25 40 58 46 
15 2 42 48 12 21 17 
27 4 28 28 31 26 24 
19 9 32 40 11 13 19 
29 4 24 37 26 0 27 
V 
2 4 23 3 
16 5 42 44 29 24 34 
56 12 67 48 20 40 26 
22 5 58 48 28 41 20 
32 4 57 33 20 4 16 
18 8 49 4T 19 36 42 
24 15 87 52 36 34 26 
22 12 14 48 25 16 27 
22 10 38 46 21 0 20 
21 21 32 33 11 43 37 
13 10 52 40 29 35 11 
ay 3 60 49 43 13 37 
2 10 29 49 10 21 27 
20 4 50 55 22 8 27 
25 11 76 43 26 20 26 
14 6 40 38 35 8 46 
п 2 32 56 37 4 26 
2 9 61 45 20 
38 17 56 67 25 20 5 
16 6 61 42 29 23 21 
14 4 17 44 26 7 21 
23 25 61 48 23 29 16 


shirt Lis à vocabulary test; part II is composed of arithmetic reason : i 
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4. Do the same as in Exercise 3 for parts V and VI, or any other pair of parts. How 
many pairs of coefficients of correlation are possible with Data 84? State a general rule 
for the number of intercorrelations when there are п variables. 

5. Compute the Pearson r for Data 8B. Interpret your findings. 


Dara 8B. A ScATTER DIAGRAM OF REACTION-TIME MEASUREMENTS AND GRADES 
EARNED IN GENERAL PSYCHOLOGY 


Reaction time to S lets uy 

auditory ШАА а 10-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 
.180-.189 1 
.170-.179 1 
.160-..169. 2 1 1 1 
.150-.159 1 1 1 
.140-.149 1 1 2 1 1 1 
.130-.139 1 e 
120.129 e 1 
.110-.119 2 а 15g 
100.109 1 


6. Find five Pearson r coefücients reported in the literature. Tell what variables were 
being correlated in each case. Interpret the results. Are the coefficients about the sizes 
you would have expected for the things correlated? Were there any special conditions 
that may have biased the amount of correlation in one way or another? 


Answers 


1. The seven parts of the A plitude Survey were designed to measure different abilities 
that are relatively independent, and hence to correlate low with one another. The correla- 
tion ria (between part I and ID) is found to be —.16 and +.47 in the first and last 10 pairs of 
scores, respectively. (Incidentally, this somewhat large discrepancy shows how widely the 
correlation between the same two variables can fluctuate from sample to sample, when 
samples are very small) The correlation for all 40 pairs is 4-.37. Typical correlations in 
larger samples have been .25, .57, and .40, for college men, high-school boys, and high-school 
girls, respectively. i 

2. ras (parts П and П): 18 and. 49. In larger samples (the same as in answer to Prob. 
1) ras was .18, .37, and .33. 

3. ru = .25. In larger samples it was .20, .07, and .31. 

4. rss = .27. In larger samples it was .61, .34, and .46. The number of pairs of vari- 
ables equals n(n — 1)/2. 

5. r = —.075 between reaction time and grades in psychology. 


For additional information on intercorrelations of these tests, see Michael, W. B., 
Zimmerman, W. S., and Guilford, J. P. An investigation of the nature of the spatial-rela- 
tions and visualization factors in two high-school samples. Educ. psychol. Measmt., 1951, 
11, 561-577. 


CHAPTER 9 


THE RELIABILITY AND SIGNIFICANCE OF STATISTICS 


In this chapter we raise the very important question as to how near the 
“truth” are statistical answers such as means, standard deviations, propor- 
tions, and the like. As was said before, any measured sample is usually 
employed to represent a larger population. A population, from the statistical 
point of view, is any arbitrarily defined group. The term will be more fully 
explained in later paragraphs. 

Our sampling has to be limited for practical reasons; we cannot measure 
total populations, or at least it is generally inefficient and unnecessary to do 
so. Yet we usually wish to generalize beyond our sample, arriving at scien- 
tific decisions that transcend the observations made at a particular time and 
in a particular place, or reaching administrative decisions that apply to 
larger groups of individuals. In preceding chapters we have been concerned 
with descriptive statistics only, The computed values were used to describe 
the properties of particular samples. If we want to apply those same 
descriptive statistics beyond the limits of samples, we must know how much 
risk of being wrong we take, In general terms, the statistics stressed in this 
chapter are designed to do that very thing. They are known as sampling 
statistics. 

То be more specific, when we obtain the mean of a sample that is measured 
in some respect, before we say that this obtained mean also describes the cen- 
tral value of the population sampled, we need to find some basis for believing 
that it does not deviate very far from the population mean. Fortunately, 
there is a statistical procedure that will inform us about how far our obtained 
mean probably deviates from the population mean, provided certain condi- 
tions, to be explained later, have been satisfied. The statistic that will do 
this is known as the standard error of the mean. Ina similar manner, there are 
standard errors of other sample statistics—medians, standard deviations, 
proportions, correlation coefficients, and the like—which inform us of the 
accuracy of our obtained figures as estimates of the corresponding population 
values. 

Some PRINCIPLES or SAMPLING 
Before going into the treatment of sampling statistics, it is necessary to 


have clearly in mind the essential facts about the process of sampling. The 
154 
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application of sampling statistics depends upon certain conditions of sampling. 
If these are not satisfied, standard errors, no matter how accurately com- 
puted, may give wrong impressions. At best, they give us only estimates 
from which we can make decisions and draw conclusions, never with complete 
conviction but with various degrees of assurance. After making this frank 
confession as to the limitations of sampling statistics, it should also be asserted 
that without them we can hardly draw any generalized conclusions at all 
that would be of scientific or practical value. 

Populations and Samples. It is time that we had a better definition of 
population. Some statisticians call it universe. In any case, the statistician’s 
idea of population is quite different from the popular idea. Rarely would any 
statistical study regard the entire population of a nation, a city, or of some 
geographical region as its universe. 

The population in a statistical investigation is always arbitrarily defined by 
naming its unique properties. It might be the entering freshman class in a 
certain university, or the part of the freshman class entering a certain college 
or even a certain course. It might be the male sixteen-year-olds in a given 
school district; the children of Mexican parentage in a certain city; or the 
registered Democratic voters in the New England states. All these examples 
are of groups of human individuals. Populations could, of course, be defined 
as species, or phyla, or order of animals or of plants. 

There are also populations of observations ox of reactions of a certain kind- 
simple reactions to sound stimuli, word-association reactions, judgments of 
pleasantness of colors, and the like, from the psychological laboratory. It is 
probably the nonhuman groups that have seemed to require the more general 
term universe as an alternative to the more restricted term population. In 
this volume we shall use the term population in the broad sense to include all 
sets of individuals, objects, or reactions that can be described as having a 
unique pattern of qualities. 

Parameters and Statistics. If we were to measure all the individuals of a 
population and actually to compute the indices of central value, dispersion, 
and correlation, as we ordinarily do for samples, we should obtain what the 
statistician calls parameters. The population parameters exist whether we 
compute them or not, if we ignore dynamic changes that may be occurring 
and assume for practical purposes that these parameters are fixed, at least 
for a time. 

Figure 9.1 illustrates the distinction between population parameters and 
sample statistics. The larger distribution is that of the entire population, 
The smaller distribution is of a sample drawn at random from that population. 
The population parameters, mean and standard deviation, are symbolized by 
M and e, each with a bar over it.! It will be noted that in this particular 


1 The bar over a quantity often indicates “the mean of." For example, X is sometimes 
used to indicate the mean of X." Some writers on statistics use the Greek letter д and o 
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sample the mean (M) and the standard deviation (c) do not coincide exactly 
in size with their corresponding parameters (M anda). This is character- 
istic, A second sample would be expected to have still different M and c, but 
also similar to M and z in size. : 
The same sort of parallel could be illustrated with respect to proportions 
(Band р), semi-interquartile ranges (Q and Q), and coefficients of correlation 
(Fandr). By careful and adequate sampling we hope to arrive at statistics 
that will approximate the corresponding parameters very closely. By the 
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Fic. 9.1. A comparison of a population distribution and a sample distribution, also of 
population parameters and sample statistics, 


use of standard errors and other sampling statistics, to be discussed later, we 
estimate how far our obtained statistics may have deviated from their corre- 
sponding parameters. 

S civi Sampling. It should be kept in mind that the use of sampling 
Statistics (standard errors and the like) rests on the assumption that the 
sampling has been random. The best definition of random sampling is that 
it is selection of cases from the population in such a manner that every indi- 
vidual in the population has an equal chance of being chosen. The selection of 
any one individual is also in no way tied to the selection of any other. This calls 
to mind a well-conducted lottery, selective-service numbers, coin tossing, 
throwing dice, and other operations that allow the “laws of chance” to 
operate freely. 


to stand for the population parameters, mean and SD, respectively, and Roman letters 
M and s to stand for sample statistics. The use of M and for the parameters is consistent 
with one operational conception, to the effect that these parameters are the means of a 
very large number of sample M's and o’s. 
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There are several ways of favoring random sampling from populations. 
For a population of individuals, if all members are arranged in alphabetical 
order and one wishes to draw one person in every hundred, the first case 
might be taken by blind pointing within the first hundred names and every 
hundredth one following in the list automatically chosen. Tables of random 
numbers have been published as an aid in random sampling. The numbers 
themselves have been placed in sequence by some kind of lottery procedure. 
If individuals in a population are numbered in sequence and thus identified 
by number, selections can be made by following the random numbers in any 
systematic way. A random sample should be fairly representative of the 
population, though in any particular sample, if it is a small one, in particular, 
by chance it may not be so representative as we would like. 

Biased Sampling. In a biased sample there is a systematic error. Certain 
types of cases have an advantage over others in being selected. The likeli- 
hood of individuals being chosen differs from one to another. A common 
example of this in educational research is the voluntary return of question- 
naires. The names of those who are to receive the questionnaires may, to be 
sure, be randomly chosen from a much larger group. But suppose that only 
60 per cent of those circularized return the questionnaires, which is not an 
atypical event. The 60 per cent who do return the data might possibly be 
representative, but there is a strong presumption that in the decision to return 
or not to return the instrument there is room for biasing forces to work. 
Those forces may or may not be relevant to the content of the questionnaire 
itself. But if the information requested implies favorable or unfavorable 
facts about the respondent, his associates, or his work, it is quite natural to 
expect that those with a “good” showing will be more inclined to reply than 
those with a “bad” showing. If the trait of cooperativeness or of responsi- 
bility or of dependability of the respondent is involved in the data or even 
correlated with something wanted in the data, there is also a strong likelihood 
of bias. 

A colossal example of biased sampling is that of the Literary Digest public- 
opinion poll during the 1936 presidential campaign. Several million post- 
card ballots were said to have been circulated, certainly anticipating a sample 
of most generous size. But the mailing lists were made up from telephone 
directories and automobile registration lists. It so happened that in the poll 
the telephone subscribers and car owners voted with a majority in favor of 
the candidate who lost, while the non-telephone subscribers and non-car 
owners voted at the polls in a more decisive way for the successful candidate. 
Among those who received post-card ballots there was also probably a selec- 
tion as to which ones would be most likely to take the trouble to return the 

1 Examples are Tippett, L. H. C. Random Sampling Numbers, New Vork: Cambridge, 
1927; and Lindquist, E. F. Statistical Analysis in Educational Research. Boston: Hough- 
ton Mifflin, 1940. Table 18. 
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card. 'Those who were most discontented with things as they were and 
wanted a change would take the trouble to register a protest straw vote. 
Those who were contented or who felt somewhat secure as to the outcome 
would be less likely to return the card. This would also tend to make the 
vote appear to favor the losing candidate, who was running against an 
incumbent. 

The scientific investigator must be eternally vigilant to the possibility of 
biased sampling. A good, systematic control of experimental conditions is 
designed to prevent biased samples or to make known their effects. Where 
there is less than customary experimental control of the observations, every 
possible effort should be made to know the conditions under which the data 
are obtained. Thorough knowledge of the conditions should be a basis for 
deciding whether selection of cases has been biased. Knowledge of condi- 
tions is also essential for the sake of accurate definition of the population 
sampled. 

Stratification in Sampling. One common procedure that is introduced in 
sampling to help to prevent biases and also to assure a more representative 
sample is known as stratification. Stratification is a step in the direction of 
experimental control. It operates with subgroups of more homogeneous 
composition within the larger population. 

A very common example is to be found in public-opinion-polling practices. 
Suppose the issue to be investigated is public attitude toward a certain piece 
of labor legislation. It is quite likely that people in the two major political 
parties would tend to lean in opposite directions on such an issue. It is prob- 
able that people of different socioeconomic categories—professional, business, 
office worker, semiskilled laborer, and unskilled laborer—would react with 
some systematic differences on the issue. It is possible, though not so likely, 
that individuals of the two sexes would tend to respond somewhat differently. 
Other divisions of the population, such as rural versus urban, regional, and 
educational groups, might also show systematic differences on the issue. In 
other words, subgroups of the population are considered with respect to any 
variable that is suspected of correlating appreciably with the variable being 
studied. It does not matter that some of the variables are themselves inter- 
correlated unless such an intercorrelation is very high, in which case it would 
be superfluous to control selection of samples on both of two variables so 
closely related. 

Having decided which variables are important in sampling, the. entire 
population js studied to see what proportions fall into each category, i.e., 
what proportions are Democrat or Republican; male or female; urban or 
rural; in each socioeconomic group; and so on. Any sample to be obtained, 
then, should have proportional representations from all subgroups. Within 
each defined subpopulation, for example, a male, professional, Republican, 
New England group, random sampling may then be carried out. Random 
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selection of cases would also be made within each of the other defined sub- 
populations in appropriate numbers. The total sampling procedure here 
described has been called stratified-random sampling. 

The importance of the proportional-representation principle and its advan- 
tage over a purely random sampling can be readily demonstrated. Suppose 
that 55 per cent of the Republicans and 45 per cent of the Democrats are in 
favor of a certain labor bill. In the general population let us assume that 
60 per cent are registered Democrats and 40 per cent are registered Republi- 
cans. Ina random sample of 100 voters one would expect in the long run to 
draw the two party representatives in about the same ratio, 60/40. This 
would vary from sample to sample, however, even to the extent that the 
majority could be reversed; for example, it could even be 45/55. In the 
typical polling sample we should expect a majority of voters against the bill. 
If the sample should by chance contain a majority of Republicans, however, 
the majority might favor the bill. If stratification were applied, we should 
be sure to have in the sample the ratio 60/40, and with this restriction 
imposed upon the random sampling we should expect the general population 
sentiment to be more accurately reflected. Thus it can be seen that a strati- 
fied-random sample is likely to be more representative of a total population 
than is a purely random sample. 

Purposive Samples. A purposive sample is one arbitrarily selected because 
there is good evidence that it is very representative of the total population. 
Experience has shown in public-opinion polling that there are certain states 
or regions that come close to national opinion time after time. If one is 
willing to depend upon this experience, one may use the limited population 
as the source of the sample to use as a “barometer” for the total population. 
This is a convenient procedure, but it has the disadvantage that much prior 
information must have been obtained. There is also a risk that conditions 
may change to the extent that the particular segment of population no 
longer represents the total or does not represent it on some new issue. 

Incidental Samples. The term incidental sample is applied to those samples 
that are taken because they are the most available. Many a study has 
been made in psychology with students in classes of beginning psychology 
as the samples merely because they are most convenient. Results thus 
obtained can be generalized beyond such groups with considerable risk. 

Generalizations beyond any sample can be made safely only when we 
have defined the population that the sample represents in every significant 
detail. If we know the significant properties of the incidental sample well 
enough and can show that those properties apply to new individuals, those 
new individuals may be said to belong to the same population as the members 

! Such a sample is often called accidental" In no real sense is the sample an accident; 
it was selected. It would be an “accident,” of course, if the sample represented usefully a 
population in which we want to make predictions of parameters. 
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of the sample. By "significant properties" is meant those variables that 
correlate with the experimental variables involved. They are the kind of 
properties considered above in connection with stratification of samples. 
It is unlikely that membership in a political party would have much bearing 
upon the results of certain experiments performed upon sophomores in a 
beginning psychology course, but such variables as age, education, social 
background, and the like may definitely be pertinent. 

Much depends upon the experimental variable under study; whether 
it is a motor skill or a social attitude, a suggestible reaction or an interest- 
test score. If incidental samples are employed, the investigator is under 
scientific obligation to describe the properties of his group in all aspects that 
he can conceive as being related to the outcome of the investigation. 


Tue RELIABILITY OF AVERAGES 


The Distribution of Means of Samples. Suppose that we are dealing 
with a population whose mean (Jf) is 50.0 and whose standard deviation 
(v) is 10.0 on the measuring scale we are using. Such a distribution is 
illustrated by the top diagram in Fig. 9.2. We do not know these popu- 
lation parameters ordinarily, but for the sake of an illustration we shall 
assume that we do know them here. 

Sampling Distributions. Suppose, next, that we proceed to draw ran- 
dom samples, all of equal size, one at a time, from this population. To 
satisfy the conditions of random sampling in a strictly mathematical sense, 
we should replace each sample drawn, after noting the value of each of its 
members, before drawing the next sample. Each individual should have 
an equal opportunity of being selected in every sample. Having lost one 
sample, the population is different from what it was originally. When the 
population is very large, as compared with the size of sample, however, we 
can forget about this replacement requirement for practical purposes. In this 
case, one sample would “hardly be missed;" that is, its loss would change 
the chance conditions to an inconsequential degree. We shall find, later, 
that when the size of sample is not decidedly smaller than the population, 
it is possible to make allowance for this fact. 

To take a specific example of random sampling, with the same population 
described above in mind, let the size of sample be 25. The sample mean 
will not only differ from sample to sample but will also usually deviate from 
the population parameter (in this example, the mean of 50.0). If we havea 
number of such sample means, we may treat them just as if each were a 
single observation and set up a frequency distribution of them. This is 
known as a sampling distribution. Such a frequency distribution will be 
close to the normal form when the population distribution is not seriously 
skewed and when JV is not small (i.e., not less than 30) 

Normality of distribution of single cases in ‘the total population favors 
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normal distribution of means and of other statistics computed from samples 
drawn from that population. Even when the population distribution departs 
from normality, however, the distribution of means of samples drawn from 
it tends to be normal, unless too small. The smaller the sample, the more 


Distribution of individual measures 6-10 
for a whole population 


Distribution of means for 
samples of one case each 


Distribution of means for 
samples of two cases each 


Distribution of means for 
Samples of three cases each 


Distribution of means for = 
samples of four cases each "n 


Distribution of means for 
samples of 16 cases each 


Distribution of means for 
samples of 25 cases each 


Fic. 9.2. Showing the hypothetical decrease in variability or fluctuation of the means of 
samples as we increase the size of the sample drawn at random from a large population. 
(Modified from Lindquist, A First Course in Statistics. Houghton Mifflin. By permission.) 


does the form of distribution of the population affect the form of distribution 
of the means. The extreme case would be samples of only one case each, 
in which event we should expect the distribution of means (if means of one 
observation each have any real meaning) to be of the same form as that of 
the population. 

A knowledge of the form of sampling distribution of a statistic is very 
important. Our ability to draw conclusions known technically as statisti- 
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cal inferences depends upon knowing the form of distribution of sample 
statistics. Without knowledge of the form of sampling distribution, many 
a scientific result would remain inconclusive. The reasons for this will be 
clearer as we go into the subject of interpretation of standard errors. 

The Standard Error of a Mean. At this stage of getting acquainted 
with sampling distributions, we are most interested in the dispersion of 
statistics, in this case, the dispersion of sample means. The reason is that 
the amount of this dispersion gives us the clue as to how far such sample 
means may be expected to depart from the population mean. If we are 
to use a sample mean as an estimate of the population mean, any deviation 
of such a sample mean from the population mean may be regarded as an 
error of estimation. The standard error of a mean tells us how large these 
errors of estimation are in any particular sampling situation. The standard 
error of a mean is a standard deviation of the distribution of sample means. 
To distinguish such a standard deviation from the more familiar one that 
applies to dispersions of individual observations, we call it a standard error. 
Tn later discussions it may be referred to by use of the abbreviation SE. 

Та order actually to compute the standard error of a mean, we need two 
items of information: the population parameter 2 and the size of sample N. 
Since we do not ordinarily know 2, it would seem that we could but rarely, 
indeed very rarely, compute this standard error. There are satisfactory 
ways of estimating it, however, as we shall see later. The formula for com- 
puting the standard error of a mean is 

СА 


— (Standard error of an arithmetic mean computed from a 9.1) 
NN known population parameter) (9. 


Where д = standard deviation of the population and W = number of cases 
in the sample (not the number of means in the distribution of means). 

Sample Size and the Standard Error of a Mean. The standard error of 
the mean is therefore directly proportional to the standard deviation of the 
population and inversely proportional to the size of the sample. More 
precisely stated, гм is inversely proportional to the square root of the size 
of sample. As the individuals of a population scatter more widely, so will 
the means of samples drawn from that population also scatter more widely. 
But as we include more individuals in each sample drawn, the less widely 
can the means scatter from their central value. In the limiting case, if the 
sample includes the entire population, the deviation of the sample mean 
from the population mean can then be only zero, and ¢ is zero. 

In Fig. 9.2 are shown graphically several instances of samples when N 
varies. The smallest possible sample occurs when N — 1. The mean of 
each sample is then identical with the individual's measurement in that 
sample. The dispersion of such means is as great as the dispersion of the 
total population; 2м then equals z, which we have assumed to be 10. When 
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each sample contains two cases, гм = 10/4/2 = 7.07; when each sample 
contains four cases, ¢ = 10/4/4 = 5; and so on. The remaining cases in 
Fig. 9.2 should now speak for themselves. 

Estimating the Standard Error of a Mean from Known Statistics. For- 
mula (9.1) requires our knowing the parameter 2 in order to compute the 
standard error of a mean. In ordinary practice we must be satisfied with 
an estimate of this standard error. Two ways for making this estimate will 
be described. 

Estimation of гм from с. In describing a sample, we usually compute o 
as well as the mean. When c is known, we may estimate the statistic 2 by 
the formula 


ом = Wo (Standard error of a mean estimated from о) (9.2) 
‘= 


The reason for the expression N — 1 in this formula can be better under- 
stood after we consider the next estimation method. Some writers recom- 
mend that for large samples (N of 30 or above) we simply substitute с for 2 
in formula (9.1), in which case we should have the ratio c/ WM instead of 
the ratio g/ V — 1. This overlooks the fact that о is actually a biased 
estimate of @ for samples of any size; the smaller the sample, the greater 
the bias. There is no sudden change in this condition at an NV of 30. The 
result of using formula (9.2) is identical with that from the next procedure, 
which is favored by statisticians. 

Estimation of г from a Sample. The standard deviation in a sample is 
likely to be smaller than that for the population from which the sample 
came. Recall from the discussion in Chap. 5 that as samples become smaller 
the total range of measures is more and more curtailed. This comes about 
from the fact that extreme deviations in the population are rare and in small 
samples are likely to be missed. This fact also has an effect upon the stand- 
ard deviation, though to a smaller extent. In the smaller samples, par- 
ticularly, с gives an estimate of the population ¢ that is biased downward. 

A less biased estimate of 2 is given by the formula 


(Best estimate of population standard deviation) (9.3) 


where Za? = sum of squares in the sample and № = number of cases in the 
sample. Statisticians say that s? is an unbiased estimate of the population 
variance 2? but that s involves a little bias as an estimate of the population 
standard deviation г. The reasons for this are rather involved and need 
not concern us here. In any case, the bias in s is smaller than that in 0, 
when they are used as estimates of 2. 

Degrees of Freedom. Formula (9.3) contains an important new concept 
that will be found liberally utilized hereafter when sampling errors (devia- 
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tions of statistics from parameters) are mentioned, particularly in connection 
with small samples. 

Compare formula (9.3) with the basic one for the standard deviation of a 
sample [formula (5.5)] and it will be found that they are identical except 
for the denominators, which are (№ — 1) and N, respectively. The differ- 
ence between the two may seem very slight (and it is slight numerically 
when X is reasonably large), but there is a very important difference in 
meaning. In this particular formula, (V — 1) is known as the number of 
degrees of freedom, which is symbolized by df. This is a key concept in 
recent years in what has been known as small-sample statistics. The number 
of degrees of freedom will not always be (V — 1) but will vary from one 
statistic to another, as will be pointed out in various places later. Let us 
see why the number is (N — 1) here. 

The “freedom” part of the concept means freedom to vary. The standard 
deviation is computed from the variance, and the variance is computed 
from deviations from the mean. Statisticians often express the matter 
by saying; that 1 degree of freedom is “used up" when we compute the mean 
of a sample. This leaves (N — 1) degrees of freedom for estimating the 
population variance and the standard deviation. 

A numerical example will make this clearer. Let us assume five measure- 
ments: 5, 7, 10, 12 and 16, the mean of which is 10.0. A mathematical 
requirement or property of the arithmetic mean is that the sum of the devia- 
tions from it equals zero. The five deviations in this sample are —5, —3, 
0, +2, and +6, the sum of which is zero. With this condition satisfied, 
i.e, the sum equal to zero, how many of these deviations could be simul- 
taneously altered (as if by taking new samplings) and still leave the sum 
equal to zero? With a little thought or trial and error it will be seen that if 
any four are arbitrarily changed, the fifth is thereby fixed. We could make 
the first four —8, —4, +1, and —2, which would mean that for the sum to 
equal zero the fifth has to be +13. Try any other changes and if the sum 
is to remain zero one of the five deviations is automatically determined. 
Thus only four (N — 1) are “free to vary” within the restriction imposed. 

The restriction is that the mean is taken as fixed for the sample. In this 
sense, the computation of the mean "uses up” 1 degree of freedom. There 
were N degrees of freedom in computing the mean because the cases were 
presumably sampled entirely independently. If they were not independently 
sampled, then there were also less than N degrees of freedom in computing 
the mean. We shall see examples of this later. Freedom means inde- 
pendence, and only when there is independence of observations can the 
“laws of chance" operate freely and the mathematics based upon the “laws 
of chance" be applied. 


1 For an excellent discussion of the general subject of degrees of freedom, see Walker, 
Н. M. Degrees of freedom. J. educ. Psychol., 1940, 31, 253-260. 
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The SE of a Mean Directly from a Sum of Squares. Whether we precede 
the estimation of the standard error of a mean by computing o or s from the 
sample, we find ourselves performing the same steps, but in a different order. 
These steps are dividing by (N — 1) and by N. If we should happen to 
have no interest in knowing the value of either с or s, we can combine these 
two operations in a single equation, and we have the formula 


e: zx? (Standard error of a mean estimated directly from 9.4. 
см = Nw - =a) a sum of squares) (9.4) 


Interpretation of a Standard Error of a Mean. We are now ready to 
apply the standard-error formula to a concrete instance and to consider 
the interpretation of the obtained SE. To revive an old illustration, the 
ink-blot data, we find that с is 10.45 and N is 50. Applying formula (9.2), 
om = 10.45/+/49 = 1.49. For simplicity in discussion, let us round this 
to 1.5. 

What we are asking when we estimate this standard error is, “How far 
from the population mean are the sample means like this one we obtained 
likely to vary?” We do not know what the population mean is, but from 
the value 1.5 we conclude that means of samples of 50 cases each would not 
deviate from it in either direction more than 1.5 units about two-thirds of 
the time. We may conclude this because in a sample as large as 50 we may 
assume that the sample means are normally distributed. This assumption 
makes possible a number of inferences that we could not make without it. 

Since, as we have already seen, in this situation of the ink-blot data we 
may conclude that two-thirds of the sample means (when W is 50) will lie 
within 1.5 units, plus or minus, from the population mean, we can also 
say that there is only 1 chance in 3 for a sample mean to be further than 
1.5 units from the population mean. Or we can say that the odds are 2 to 
1 that sample means will be within a range of three units, the middle of 
which is the population mean. The standard error thus brackets a range 
within which to expect sample means. We shall expand this idea in the 
discussion to follow. 

Hypotheses concerning the Population Mean. The kind of conclusion that 
we should most like to make is slightly different from the one just given. 
We are attempting to estimate the population mean, knowing the sample 
mean. We should therefore like to know how far away from the sample 
mean the population mean is likely to be. 

It might seem that, if we can say that two-thirds of the sample means 
are within one SE of the population mean, we could also say that the odds 
are 2 to 1 that the population mean is within one SE of the sample mean. 
But note that the last statement implies a normal distribution about the 
sample mean, whereas, actually, the sampling distribution is about the 
population mean. In all logical strictness, we cannot reverse the roles of M 
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and M in this manner, But through some mathematical reasoning, which 

we can explain only briefly here, we can do something equivalent. The 

process results in settling up confidence intervals and confidence limits for the 
tion mean. 

Since we do not know the population mean, we are free to make some 
guesses, or hypotheses, about its value, No matter what reasonable hypoth- 
esis we choose, the estimated standard error still applies to the distribution of 
expected sample means about this hypothetical value. 

In the ink-blot problem, the sample mean was 29.6, Let us select in 
turn a number of hypothetical population means. They should, of course, 


be somewhere in the neighborhood of the sample mean. Figure 9.3 shows 


five normal sampling distributions, each about a different hypothesized M 
and each with a standard deviation (SE) of 1.5. The hypothesized means 
are all above 29.6; they could just as well have been chosen below that value. 
They are at the values 30.0, 31.0, 32.0, 33.0, and 34.0, 


300 31.0 320 330 340 


i (Hypothetical population means) 


Fw, 9.3. Hypothetical sampling distributions corresponding to various hypotheses concern- 
ing the population mean when the obtained sample mean is 29,6. 

Consider, first, the hypothesis that is farthest from the sample mean, 
namely, a hypothetical M of 34.0, A sample mean of 29.6 deviates 4.4 
score units from this hypothetical Af. This deviation gives a z (standard- 
score value) of 4.4/1.5 = 2.95, Since we are dealing here with a sampling 
distribution, whose elements are means, not single observations, and whose 
constants are estimated parameters, let us use the symbol 4 to indicate such 
a standard-score value, We may enter the normal-curve tables with such a 
value as we would for an ordinary z. 

We next ask what is the probability of a deviation as large as thís occurring 
by random sampling. This probability is twice the proportion of the area 
under the tail of the normal curve beyond the point at £ = 2,95. When we 
say “а deviation as large as this," we actually mean a deviation as large or 
larger. This would include all sample means of 29.6 and lower. Since by 
chance it is just as easy to obtain deviations in the opposite direction (remem- 
ber that the normal distribution is symmetrical), we need also to include 
in our consideration the area in the other tail. This would include all 
sample means deviating to 38.4 (34 + 4.4) or more. Table B (Appendix B) 
indicates that with a 4 of 2.95 the area in one tail is 0016. Doubling this, 
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we have .0032. We can conclude that, if the population mean were 34.0, 
there is only the slim chance of 32 in 10,000 for a mean of such extreme value 
as 29.6 to occur by random sampling. Since these odds are so small, we 
reject with much confidence the hypothesis that the population mean is 34.0. 

The next hypothesis is for M equal to 33.0, which gives a deviation of 3.4 
and a £ of 2.28. The area under the normal curve beyond this point is 
.0113. Twice this area is .0226. If the population mean were actually 
33.0, there are only about 2 chances in 100 for a departure such as a sample 
mean of 29.6 to occur. If we reject this hypothesis, there are only 2 chances 
in 100 that we would be wrong. Although we could not reject this hypothesis 
with as much confidence as we could the previous hypothesis, we could still 
do so with a high level of assurance. 

If we hypothesize a M of 32.0, the deviation is 2.4, 2 is 1.61, and the tail 
area (doubled) is . 1074. The chances for a random deviation this large 
are more than 10 in 100. If we hypothesize that M = 31.0, the deviation 
is 1.4, 2 is 0.94, and the probability for so large a deviation is .348. We could 
not very well reject the hypothesis that the population mean is 31.0. There 
would be considerable risk in the decision to do so. We can say that this 
hypothesis is rather plausible. 

But other hypotheses are even more plausible. If we choose the hypothe- 
sis that M. = 30.0, the deviation is 0.4, £ is 0.267, and the area beyond this 
deviation is .788 of the total. Thus, as we approach the sample mean closer 
and closer with our hypothetical population mean, the odds in favor of 
greater deviations than the obtained one keep increasing. The hypothesis 
becomes more and more plausible. The maximum plausibility would be 
reached when the hypothesis is 29.6, in other words, when it coincides with 
the sample mean. From this point of view, we can say that the sample 
mean (when other information is lacking) is the most defensible estimate 
of the population mean. It is an unbiased estimate, since the deviations 
are as likely to be positive as negative. 

Confidence Limits and Confidence Intervals. From this discussion the 
general picture is that of a sliding scale of confidence with respect to 
the location of the population mean. Possible values more remote from the 
sample mean can be rejected with much confidence; values nearer to the 
sample mean can be rejected with less and less confidence as we approach 
the sample mean. It is not customary to go through the kinds of steps we 
have just seen in order to interpret a mean and its standard error. By 
common consent an arbitrary choice has been taken to adopt two particular 
levels of confidence. One is known as the 5 per cent level, or .05 level, and 
the other as the 1 per cent level, or .01 level. 

At the .05 level is a deviation that leaves 5 per cent of the area in the two 
tails of the normal distribution—2.5 per cent in each tail. This area at 
either end is marked off at a $ value of plus or minus 1.96. The .01 level 
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leaves 1 per cent of the area in the two tails, .5 of 1 per cent in either tail. 
The # that marks off this much area at either end is 2.58. These percentages 
and these values are applied regardless of the size of the mean or of its 
standard error. It must be remembered, however, that they apply only 
to large samples. 

For the ink-blot problem, a # of 1.96 corresponds to a score deviation of 
2.9 (which is 1.96 times ey), All hypotheses of population means differing 
more than 2.9 from the sample mean can be rejected at the .05 level. Only 
once in 20 times would we be in error by making this decision. (This once 
would be when the deviation is really due to chance.) Since these con- 

limits are 2.9 units from the sample mean, they come at score values 
of 29.6 — 2.9 and 29.6 + 2.9, or at 26.7 and 32.5, respectively. The score 
limits of 26.7 and 32.5 mark off a confidence interval within which the popula- 
tion mean probably lies. The probability to be associated with this interval 
is 95 (%., 1.00 — 05). 

We can make a similar interpretation in connection with the .01 level. 
All hypothetical means differing more than 3.9 (3.9 is 2.58 times см) from 
the sample mean can be rejected, with only 1 chance in 100 of being wrong 
in doing so, The confidence interval is from 25.7 to 33.5, and the probability 
to be associated with it is,99, We have a high degree of assurance that the 
population mean is between 25 and 34. The odds are 99 to 1 in favor of 
this conclusion. Whether we wish to stake our case on the .05 limits or the 
01 limits depends upon our inclinations. In the next chapter we shall find 
much more discussion on the choice of standards of confidence." 

Comparisons of Some Obtained Means and Standard Errors, Let us apply 
the interpretation of ом to some other data. The practical usefulness of a 
statistic is often more apparent when comparing the same statistic derived 
from different data. In Table 9.1 are given means of Army General Classi- 
fication Test scores for samples derived from different civilian occupational 
groups. For the sake of an illustration, we will assume that each occupa- 
tional group represents a different population, as designated, and that the 
— — What do the standard errors in this table 

us. 

The mean in which we would have the greatest confidence, as repre- 
senting the status of the general occupational population, is that for the 
truck driver. The odds are about 2 to 1 that this sample mean of 96.2 
does not deviate more than .7 from the mean of all truck drivers that this 
sample represents. We could be practically certain (allowing a margin 
of +36) that the obtained mean for truck drivers is not over two units 
distant from that of all truck drivers of this kind. The mean in which we 
have least confidence is that for teamsters by reason of its oy of 2.23, 


1 Other confidence levels sometimes used are the 10 per cent, or .10, level (when # is 1.65); 
the .02 level (when # is 2.53); and the .005 level (when 2 is 2.81). 
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TanLE 9.1. Comparison ОР MEANS Or SCORES ON THE ARMY GENERAL CLASSIFICATION 
TEST AS APPLIED TO MEN FROM DIFFERENT CIVILIAN OCCUPATIONAL CATEGORIES* 


Occupation 


Accountant, . 0.88 
Lawyer. 1.13 
Reporter. 1.76 
Sales clerk. 0.74 
Plumber... 1.42 
Truck driver, 0.69 
Farm hand... * 0.72 
Tami. 2809 229 2.23 


* From Harrell, Т. W., and Harrell, M. E. Army General Classification Test scores for civilian 
occupations, Educ. psychol, Measmt., 1945, 8, 229-240, By permission of the publisher. 

Incidentally, the relation of см to both « and N can be seen roughly by 
comparison of the data for the occupational groups. On the whole, the 
largest standard errors come for samples where M is smallest—for lawyer, 
reporter, plumber, and teamster—though the rank orders are not perfect 
within this list of four. Where sample sizes are comparable, as for lawyer 
and teamster, and for accountant and plumber, the value for см is more 
apparently in proportion to the standard deviation of the sample. It can 
be seen that, if the sample is large enough, the standard error can be brought 
below one scale unit. 

Some Special Problems concerning the SE of a Mean. We shall now 
consider several conditions that have a bearing upon the standard error 
and the steps that may be taken to deal with them. 

When Sampling Is Not Random. It has been repeatedly stressed that 
sampling statistics, including standard errors, apply only when sampling 
has been random. The reason for this is that the mathematics of the 
situation are exact only when sampling has been random. Any condition 
that tends to interfere with randomness of selection of observations, there- 
fore, will make the estimation of standard errors and their application in 
drawing conclusions inaccurate, if not misleading. There are several note- 
worthy situations that depart from the random requirement. Some would 
lead to standard errors that are too small to describe the actual distributions 
of means, and others would lead to standard errors that are too large. In 
the former error, we should have too much confidence in the accuracy of 
the mean, and in the latter case we should have too little, There have been 
developed certain variations in the standard-error formulas to take care of 
some of the special situations. 

Samples with Bias. The effect of biased sampling upon the distribution 
of means can be strikingly illustrated by reference to some data on the 
training of pilots in the AAF during World War П. All pilot students were 
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given a battery of classification tests from which was derived for each man 
a “pilot stanine," or composite pilot-aptitude score. Every month at the 
completion of preflight training, students were formed into class groups, 
each sent to a different primary flying school. In one study which covered 
a six-month period, 269 such classes had been sent to 58 training schools 
divided among three AAF Flying Training Commands. The mean stanine 
for approximately 52,000 students was 5.56. This value may be taken as 
the population mean in this situation. The standard deviation of the popu- 
lation was assumed to be 1.96. The average size of sample (each class group 
in a single school) was 195.1 From this information, using formula (9.1), 
we compute a standard error of 0.14. From this we should expect two- 
thirds of the 269 mean stanines to deviate not more than 0.14 from 5.56, if 
the sampling had been random. What are the facts? 

When the 269 means were actually compiled in a frequency distribution 
and their standard deviation computed, the dispersion of means was actually 
found to be very much larger than was expected (see Table 9.2). Where 


TABLE 9.2. SAMPLING STATISTICS CONCERNING 269 CLASS Groups OF PILOTS IN 
PRIMARY TRAINING DURING A PERIOD OF Six MONTHS IN THREE TRAINING 
CowwANDS Or THE AAF pure Мові» War II“ 


Expected results Obtained results 
Variable ч 
Капде 
Pilot stanine 4.6-6.9 
Graduation rate.. * 40-90 
Validity coefficient,..,... 0.21-0.71 


* Including the pilot stanine, or composite pilot-aptitude score; the graduation rate, or percentage 
of a class graduating; and validity coefficient, a biserial coefficient of correlation between stanine and 
graduation versus elimination, 


one would expect a range of means within the limits 5.2 to 6.0, the actual 
range was from 4.6 to 6.9. Where the expected standard deviation of the 
distribution of means was 0.14, the actual standard deviation was 0.37. 
A comparison of the expected and obtained distribution of means is shown 
in Fig. 9.4. 

The obvious conclusion is that the sampling of aviation students in pilot 
classes was most probably not random. One can surmise some of the causes 
after looking into the procedures by which class groups were formed. In 
each preflight class (i. e., each month) a small percentage of students would 
fail to pass the curriculum successfully and would be held over, probably 
to qualify for flight training in the next class. There was a tendency for 


Actually, some classes deviated from 195 in number. For the sake of an illustration, 
however, we may treat the samples as if they were of constant size. 
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the “holdovers” to be sent together to the same flight schools. They 
tended to be of low pilot aptitude. There may have been some geographical 
differences in pilot aptitude which would tend to make the averages of 
stanines differ systematically somewhat from one Command to another. 


Frequency 


Obtained distribution 
(2.70.37) 


40 45 50 55 6.0 6: 5 70 75 
Pilot stanine (Aptitude score) scale 


- 52 


distribution of 


esr NET 
TN 


СКЕ 


Frequency 


0 — 
0 0.20 0.30 0.40 0.50 0.60 0.70 0.80 
Validity (correlation) coefficient scale 


Fic. 9.4, Distribution of expected and obtained sample means, also of expected and obtained 
validity coefficients, in connection with 269 samples (class groups) of AAF pilots in primary 
training during a five-month period in about 60 different schools. Especially to be noted 
is that the obtained distribution of means was much wider than expected, indicating 
nonrandom sampling, while the distribution of validity coefficients was about as expected, 
indicating random sampling. This is possible because two different kinds of sampling are 
involved. 


This hypothesis could be subjected to experimental check by comparing 
Command averages. There were probably other reasons for students of 
similar aptitudes to gravitate together, hence the biasing of samples. 
Another study was made of the graduation rates (percentage of a class 
group graduating) in different samples. The pertinent data are given in 
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Table9.2. From the over-all graduation rate of 65.3 and the size of sample, 
we should expect [by formula (9.10)] a standard deviation of the distribution 
of the 269 rates to be 3.4. Actually it was 9.5. Since the probability of 
graduation for any cadet was strongly correlated with his aptitude score, 
we should expect the bias in sampling on aptitude to be reflected in biased 
samples as to graduation rate. This is probably not the whole story, how- 
ever. There were many other conditions which could contribute to marked 
variations in graduation rate besides the variations in aptitude. Weather 
conditions varied from school to school and from month to month. Training 
practices and policies may have varied, in spite of close regulation. Instruc- 
tor and test-pilot judgments were not standardized hurdles and may have 
varied from school to school. 

A third study is mentioned now for comparison, although it involves the 
sampling errors of coefficients of correlation which are treated later. This 
study is concerned with the variation in validity coefficients in the same 269 
class groups. The validity of the pilot stanine for predicting the training 
success of pilots was indicated by what is known as the biserial coefficient 
of correlation (see Chap. 13). This has approximately the same value as a 
Pearson product-moment r but is computed when one of the variables, 
assumed to be normally distributed actually, is forced into two categories. 
The two categories for the training criterion were the graduates and the 
eliminees, The standard error for a biserial correlation equal to .53 when 
the size of sample is 195 amounts to .073 [computed by formula (13.8)]. 
The expected and obtained statistics are given in Table 9.2 and illustrated 
in Fig. 9.4. In drawing the distribution curve, normal distribution of the 
coefficients was assumed, whereas the expected distribution should be slightly 
negatively skewed. "The obtained distribution of the 269 coefficients was 
actually so skewed. At any rate, since the obtained standard deviation 
was only .088 and not so very different from the expected one (.073), we may 
conclude that if there was biased sampling with respect to the validity of 
pilot stanines it was of minor importance. While there were seemingly 
enormous variations in validity from school to school and from time to time, 
amounting to a spread from .21 to .71, those variations may be regarded as 
due mostly to sampling errors. Incidentally, this example shows just how 
much obtained correlation coefficients may deviate from the population 
parameter even with samples as large as 195. Any single obtained coefficient 
may be anywhere in the range of such a distribution, but the saving feature 
is that extreme deviations are highly improbable and small ones most 
probable. These illustrations should demonstrate more clearly some of 
the practical uses of standard errors, as well as the importance of random 
sampling if we are going to make accurate and useful interpretations. 

When Observations Are Not Independent. Random sampling also implies 
independence of observations. In the preceding examples, observations 
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were not independent because certain restricting conditions tied cases 
together; if one student was chosen to go to a certain school at a certain 
time, one or more others like him were also chosen with him. There are 
other situations where this occurs, many times without the investigator’s 
being aware of it. It is most likely to occur when sampling is obtained 
from subgroups of the population. 

Suppose that we have an experiment in which there are 10 subjects and 
each has 10 trials in each experimental session. For each session we do not 
have 100 independent observations. Nor do we have merely 10 observa- 
tions. Because there are individual differences, the 10 observations in 
each set will be somewhat homogeneous, having been derived from a single 
source, In the larger setting of the 100 observations, they are not inde- 
pendent. In computing тум for these 100 observations, the number of 
degrees of freedom is not 99. It is difficult to say just what the df should be. 
The most conservative approach would be to assume 10 observations, each 
being the mean derived from one individual, and 9 degrees of freedom. But 
this would lead to an overestimate of the standard error. In the situation 
described, we have what is called cluster sampling. For special treatments 
of this subject that include formulas for estimating тм, the reader is referred 
to discussions by Marks and by Jarrett and Henry. 

The Reliability of a Median. The variability of sample medians is about 
25 per cent greater than the variability of means when the population is 
normally distributed. Under this condition the standard error of a median 
can be estimated by the formula 

1.2530 


= (Standard error of a median estimated from с) (9.5) 


n = MN 


As applied to the ink-blot-test data, 


2 10.45 
, = азиме = 1.85 


Two-thirds of the sample medians of ink-blot scores, when NV equals 50, 
in samples drawn at random from the population will be expected within 
1.85 units of the population median. Since the population is normally 
distributed, by assumption, we may also say that the sample medians would 
not deviate from the population mean more than 1.85 units, two-thirds of 
the time. The median may thus be used as an estimate of the population 
mean, but with less confidence than we have in the use of the sample mean 
for the same purpose. 


Marks, E. S. Sampling in the revision of the Stanford Binet Scale. Psychol. Bull., 
1947, 44, 413-434; Jarrett, R. F., and Henry, F. M. The relative influence on error of 
replicating measurements or individuals. J. Psychol., 1951, 81, 175-180. 
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Tur RELIABILITY or OTHER STATISTICS 


The Standard Error of a Standard Deviation. The standard deviation 
will also fluctuate from sample to sample. For a given size of sample, 
the sampling distribution of о is somewhat skewed for small samples but 
approaches the normal form so closely for large samples that we can draw 
inferences about a sample о, knowing its standard error. This SE is esti- 
mated by the formula 


Е (Standard error of а standard deviation) (9.6) 


VN 


Applied to the ink-blot data, 


„ 1045 
V0 


We can now say that the odds are 2 to 1 that the sample с will not deviate 
more than one unit (1.045 is rounded to 1) from the population 2. 

Comparing formula (9.6) with formula (9.2) for the SE of a mean, we 
can see that a population standard deviation is more accurately estimated 
than is a population mean, when we compare them as to sampling errors. 
The denominator of these two formulas contains the values 2 and (N — 1), 
respectively, which means that the см is more than 40 per cent greater than 
сг. For the ink-blot data, the two standard errors are 1.045 and 1.49, 
respectively. In one sense it is fortunate that the standard deviation is 
more stable than the mean, because both см and c, are estimated from it. 

The Standard Error of Q. The SE of the semi-interquartile range is 
estimated by the formula 


Е .7867с 
MEZ: 
when the population distribution is normal. Fortheink-blotdatacg = 1.16. 


If the standard deviation is not known but Q is known, and the distribution 
is normal, the next best procedure is to use the formula 


1.1 
9Q — s (SE of Q estimated from Q) (9.8) 


= 1.045 


(Standard error of Q estimated from о) (9.7) 


This substitute formula is possible because in a normal distribution 
0 = 67450 


Applying this formula to the ink-blot data, we find that со = 1.24. The 
slight discrepancy between the two estimates of со just obtained may be 
due to the fact that the distribution is not normal or to minor irregularities 
in frequencies in class intervals that are crucial to the estimation of Q. This 
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suggests the need for very large samples and regular ones, in applying formula 
(9.8). 

The Reliability of a Proportion. Data in terms of frequencies, percentages, 
and proportions are so common in the social sciences that the problem of 
their reliability is very important. Out of a sample of 100 students quizzed 
at random, the proportion of them who reported the habit of reading a daily 
newspaper is .65. How well does this proportion represent the student 
population? Assuming that we have a random sample, there is a way of 
estimating how such a proportion of 100 observations might be expected to 
vary. The SE of a proportion measures this variation, and with a known or 
assumed form of sampling distribution we can arrive at conclusions as to 
the accuracy of the obtained result. 

The SE of a proportion is given by the formula 


p = JB (Computed SE of a proportion) (9.9a) 


where ў = proportion of the population who are in the category selected 
ў = proportion of the population not in the category (7 = 1 — 5) 
N = number in the sample 
We ordinarily do not know the parameters ў and ў. The practical, solu- 
tion is to use the sample and q as the best estimates we know for those 
parameters. 
The useful formula is therefore 


op =A м (Estimated SE of a proportion) (9.95) 


The outcome of formulas (9.9) depends relatively more upon the size of 
N than of p and q, because the product pg remains fairly uniform between 
20 and .25 for quite a range of values of p (namely, for p between .27 and 
73). If we have a better knowledge concerning the population 5, which is 
provided by other information, a from a larger sample or from a series of 
samples, we could use some other estimate of ў as a hypothesis. One could 
Choose some a priori estimate of $ based upon logical reasoning. This 
approach will be given more attention in Chap. 10 on “Testing Hypotheses” 
and so will not be discussed further here. 

For the newspaper-reading data suggested above, where p is .65 and № 
is 100, the SE is estimated to be 

в = SC) _ 002775 = ов 

The interpretation of this result, as usual, depends upon the form of the 
Sampling distribution of p, which approaches the normal form if M is not 
too small and if is not too close to .00 or 1.00. As # deviates from .5 in 


176 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION [сн. 9 


either direction, the distribution of p becomes skewed. A reason for this 
that can be readily seen is that no can go below .00 or above 1.00.! Dis- 
tributions are curtailed at these extremes but can extend greater distances 
in the opposite directions. As samples become large, however, sampling 
distributions become so narrow that these terminal restrictions have less 
importance. 

As a practical rule for avoiding seriously nonnormal sampling distributions 
of p, it is recommended that we forgo estimating o, or at least interpreting 
it, when the product N (or Ng, whichever is smaller) is less than 10 (some 
writers say less than 5). With the lower limit of 10 for Np, if N is as small 
as 20, only one proportion could qualify to meet the rule, namely, p = .5. 
For small samples greater than 20 there is less restriction, but some. For 
example, if № = 40, only proportions between .25 and .75 could qualify 
for meeting normal-distribution standards under the rule. There are other 
methods for dealing with cases that do not come under this rule, as we shall 
see in Chap. 11. 

The obtained с in connection with the newspaper data is .048, or approxi- 
mately .05. Since the conditions for normal distribution of the sample 
proportions are satisfied, we can say that the odds are about 2 to 1 that the 
obtained proportion is not further than .05 from the population proportion. 
Our margin of error in the proportion of .65 may be stated as .05, using the 
lo limits. With probability of .95 the confidence interval extends from 
approximately .55 to approximately .75. With probability of .99 the con- 
fidence interval extends from .52 to .78. The latter range is still all above 
.50, leaving us with considerable confidence that a majority of the students 
in this population do read newspapers. 

The Proportion as a Mean. In connection with the question of reliability 
of a proportion, it is interesting to know that in one important sense the 
proportion is actually a mean and its standard error is actually thestandard 
error of a mean. A numerical example will illustrate this point. 

Suppose we have administered a certain test item to 100 individuals, of 
whom 80 give the correct answer and 20 do not. Let each successful person 
receive a “score” of 1 and each unsuccessful person a “score” of O. That is 
actually what we usually do in scoring a test composed of items. Each item 
may be regarded as a subtest on which the range of scores is usually 2 units. 
We need not confine this reasoning to responses to test items. Wherever 
events can be classified into a certain category or not, we can arbitrarily give 
a value of 1 to all cases in the category and a value of 0 to those not in the 
category. Other examples might be possessing a habit of reading a daily 
newspaper versus not having the habit; being an alcoholic versus not being 
an alcoholic; voting for candidate X versus not voting for candidate X; and 


A more general, mathematical reason can be seen in connection with the discussion of 
binomial distributions in the next chapter. 
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soon. In terms of probability, the value of 1.0 stands for absolute certainty 
of an event's occurring and zero stands for absolute certainty of its not 
occurring. A proportion can thus be regarded as an average probability. 

Returning to the test-item problem, the mean score for the 100 individuals 
is the sum of the scores divided by the number of them, in other words, 
ZX/N or ХЈХ/М. The sum of the scores is 80 and N is 100, from which the 
mean is .8. This is also the proportion passing the item. Thus our proposi- 
tion that the proportion is a mean is demonstrated. 

To find the standard error of a mean, as by formula (9.2), we need to know 
the standard deviation of the sample. It can be shown that for a distribution 
in two categories the variance is equal to the product pg and the standard 
deviation is equal to у/рд. This is demonstrated in Table 9.3. This table 
shows both the numerical solution for this particular illustrative problem and 
also the general solution in terms of symbols. From the table it should be 
clear that the variance equals pg and the standard deviation equals MY. 
Using the latter as an estimate of the population standard deviation, by sub- 
stitution for с in formula (9.2) we have VPN VN, or Mp V, which is 
formula (9.95) for the standard error of a proportion. Note that the use of 
VN in this formula instead of VN — 1 indicates no loss of degrees of free- 
dom such as was true in computing тм. 


TABLE 9.3. COMPUTATION OF THE MEAN AND STANDARD DEVIATION FOR A 
DISTRIBUTION IN Two CATEGORIES 


TT a — —L—— 


Nass S Solution with symbols 
f FX | x fet 
Np Np a | Noa? 
Na „% | р | мзд 
ВШ ДАЛ ЫК ДЕ AN ETUR ОЛЕ Ин ШОГЫШ 
хф +9) = Neale + a) = 
N Npa 
Мо. одо А d e 
м) (е) 
Standard deviation. e 


The Standard Error of a Percentage. If we wish to work in terms of per- 
centages instead of proportions we may doso. Let the percentage be denoted 
by P and let Q equal 100 — P. Remembering that a percentage is 100 times 
its corresponding proportion, the standard error of a percentage will be 100 
times as large as that for the proportion. The formula reads 


op = 100 мч =. J£ (Standard error of a percentage) (9.10) 
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The Standard Error of a Frequency. A frequency, or the number of cases 
in а certain category, is equal to M times p, the proportion; consequently the 
standard error of a frequency is N times that for a proportion, and we have 
the formula 


с = N a = VNpq (Standard error of a frequency) (9.11) 


Out of 30 students who attempted a certain test item, 18 succeeded and 12 
failed. How much confidence can we have that the 18 successes represent 
the actual success rate for the larger population these 30 students represent? 
The standard error, assuming a population ў equal to .60, by formula (9.11) 
is equal to 4/30 X .6 X 4 = V/7.20 = 2.7. This obtained frequency may 
therefore be presumed not to deviate more than 2.7 from the average fre- 
quency to be expected if we had examined the entire population in samples 
of 30, with a degree of confidence that can be expressed asa 2 to 1 bet. With 
a degree of confidence expressed by a 19 to 1 bet, we could say that we do not 
expect that this obtained frequency departs by more than 5.4 from the aver- 
age frequency we would get from many such samples. 

Reliability of a Coefficient of Correlation. Like every statistic, the coeffi- 
cient of correlation is subject to errors of sampling. Let us say that in a cer- 
tain population the parameter correlation 7 is equal to .30. From this popu- 
lation we take successive samples of 50 pairs of observations each. The sam- 
ple 7's will fluctuate in a sampling distribution around the population value. 
An example of this has already been reported in Table 9.2, where 7 was .53. 
How much variability may we expect? We need a standard error of r and 
some knowledge of the form of sampling distribution in order to say. 

Sampling Distribution of r. The sampling distribution of r is not of uni- 
form shape. It depends upon the size of r and the size of sample. It is 
already known to the reader that the limits of r are —1.0 and +1.0. An 
obtained r cannot exceed those values. Consequently, as the population 7 
approaches those limits, the sampling distribution becomes more and more 
skewed, negatively skewed for positive r's and positively skewed for negative 
rs. Only when the population 7 is approximately zero is the sampling dis- 
tribution expected to be symmetrical (see Fig. 9.5). 

For very large samples, however, one need not worry very much about 
skewness in practice when F is within the limits of —.80 and +.80. The 
larger the sample, the narrower the dispersion of r’s, and consequently the 
less restricting effect provided by the limits of —1.0 and +1.0. 

When 7 is zero but the sample is small (under 30), although the sampling 
distribution is symmetrical it is not quite normal, for reasons which will be 
left to the discussion of small-sample statistics in Chap. 10. 

An Estimate of с. We estimate the standard error of r by the general 
formula 
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1— 
Or = INA (SE of a Pearson product-moment coefficient of correlation) (9.12) 


This formula is only a close approximation. It would, of course, be more 
accurate if we wrote 7 instead of r. There is little risk in using 7 as an esti- 
mate of the population parameter if samples are large and if r is large. 

Examination of the formula will show that, for the same size of sample, c, 
is largest when r = .00 and becomes smaller as r approaches —1.0 or +1.0. 
The size of the standard error itself indicates how much risk we take in letting 
r stand for 7. 

To illustrate the use of formula (9.12) with the case in which we know the 
population ř first, let us take the values mentioned above—with 7 = .30 and 
N = 50. We have 


1-—.09 
Or NZ: .13 
Interpreted, this means that, with a population 7 equal to .30, we may expect 
two-thirds of sample r's, when № = 50, to lie within .13 of the parameter 7, in 
other words, between .17 and .43. We also might expect 95 per cent of the 
sample r’s under these conditions to be between .04 and .56, these values 
being 20, distances from .30. There would be only 1 chance in 100 that sam- 
ple r’s could deviate as much as .335 (this being equal to 2.58e,) from the 
population value. This much deviation marks off the range from —.035 to 
.635. 

We should not be too sure of these interpretations involving the extreme 
tails of the distribution, since departures of the sampling distribution from 
normal form would show up most at those places. But it can be seen how 
even negative coefficients might arise by random sampling occasionally, even 
when the population correlation is as large as.30. The smaller the r and the 
smaller the sample, the more likely are these reversals of algebraic sign of 
correlation to occur. 

Consider next the case when we must substitute an obtained 7 for the 
parameter 7 in the use of formula (9.12). Let us use the obtained correlation 
of +.61 from the problem in Table 8.5. 


It is sufficient to report о,, as for most standard errors, to two significant 
digits. From the result we may say that whatever the population 7 may be 
(and it is probably not far from .61), an obtained r such as .61 would not 
deviate from it by more than .068 with a confidence indicated by odds of 2 to 
1. There are less than 5 chances in 100 that in samples of this size the sample 
r would depart more than .136 from the population value, and less than 1 
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chance їп 100 that the sample r would depart more than .175, above or below 
it. The obtained r, consequently, seems securely placed in a region that is 
removed from zero or negative correlations. 

The Significance of Small r's. When r's are small, i.e., in the region of zero 
but either positive or negative, our interest should usually center on the ques- 
tion as to whether such values could have arisen when the population correla- 
tion is actually zero, In the previous illustrations we were more concerned 
with the accuracy of determination of the amount of correlation. Incidental 
to that problem we saw that some sampling distributions could come close to 
zero И not extend beyond it. This becomes a very serious problem when 
coefficients are numerically small and samples are not large enough to fix the 
boundaries of sampling fluctuation definitely clear of zero. 

"The best approach to the small r is to assume that the population correla- 
tion is actually zero and then ask whether, with the size of sample being what 
it is, the obtained r could have occurred merely by random sampling. Our 
being able to conclude whether the obtained r represents any genuine correla- 
tion at all depends upon this kind of test. Incidentally, assuming that the 
population r is zero is one form, or one application, of the mull hypothesis of 
which we shall hear much more later on. Our working hypothesis is that 
there is а mil amount of correlation. 

Since formula (9.12) implies the use of the population 7, we may insert any 
value for it that we please (except +1.00, which would shrink e, to zero). 
Any r we chose to insert would be our hypothesis about the amount of correla- 
ton. We could then compute e, and test the hypothesis by seeing whether 
the obtained r deviates too far from # to be reasonable. A deviation that 
goes outside the practical limits of the normal distribution would, of course, 
be very unreasonable. A deviation that is so large as to occur by chance only 
* very small proportion of the time would also be seriously questioned. 

When the population # is zero, the standard error is estimated by the 


1 
wegen и" йг мени en 


This formula will apply satisfactorily when N is not less than 30, lyi 
this formula to the data of Table 8.5, и абе 


„ Vg 


The obtained correlation, 61, is more than five times as large as this standard 
error. So very rarely could this much correlation occur by random sampling 
in a population where X and Y are actually uncorrelated that we can reject 
the null hypothesis and say that almost certainly there is some correlation. 
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We would not ordinarily make this test of a coefficient as large as .61 unless 
the sample were quite small, Even if the sample were 26, in which case 
% = .20, this obtained correlation would be at least three times the standard 
error. 

Minimum Significant r's. A more convenient and practical procedure for 
determining whether an obtained coefficient of correlation is significantly 
different from zero is provided by the Wallace-Snedecor tables (see Table D, 
Appendix B). This approach is based upon small-sample statistics, which 
are treated in the next chapter, 

In the first column of Table D are given the number of degrees of freedom 
available for the coefficient. In each correlation problem the df is N — 2. 
‘The number of observations is the number of pairs of X and Y values. One 
degree of freedom is lost in the use of the mean of each variable, M, and M,, 
Írom which the deviation x and y of the correlation formula take their 
departure. 

Having located the proper number of df in Table D, we find in the second 
column two values. One is the minimum r that is significant at the. OS level, 
and the other, in bold-face type, is the minimum r significant at the ,01 level, 
If we are satisfied with these criteria for rejection of the null hypothesia 
regarding correlation, this procedure will do very well. If we want greater 
refinement of information or the use of other standards of confidence, we 
would use formula (9.13), or, in the case of small samples, formula (10.3). 
The minimum 's in Table D were determined by use of formula (10,3). 

Examination of Table D shows that for samples with 1,000 df r must be at 
least .062 to be significant at the .05 level. An r of .062 or larger, positive or 
negative, could arise by chance when F is zero only 5 times іп 100. If we 
reject the idea that the population 7 is zero, we have 5 chances in 100 of being 
wrong. For the same size of sample, an r of .081 is required for significance 
at the Ol level. Thus, if we obtained a correlation of .10 (either positive or 
negative) we could feel very confident that there is some relationship between 
X and Y and that it is in the direction indicated by the algebraic sign. We 
could apply the estimated, to mark off confidence limits about the obtained r, 

Thus, even very low coefficients, such as .10, may indicate a genuine rela- 
tionship, but it takes a very large sample to establish that conclusion and to 
determine its probable value. On the other hand, some obtained r's of 
moderate size may be very uncertain indicators of any relationship at all, 
when samples are smaller. Note that when N is 10 (8 df) the minimum r's 
required for the .05 and .01 levels are .632 and .765, respectively. Even if 
our obtained r exceeded those values when М is 10, the exact amount of corre- 
lation would be exceedingly uncertain. Correlations derived from small 
samples are good for little else than testing the null hypothesis, unless they 
happen to be .90 or above. When a very small r proves to be significant at 
the .01 level by virtue of a large sample, however, the fact that it is significant 
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does not necessarily mean that the relationship is a very useful one. The 
connection between X and Y might be of no consequence unless the demon- 
stration of some connection settles an important scientific fact. 

Fisher’s z Coefficient. Because of the numerous radical departures of the 
sampling distribution of r from normal form, and the limitations to our inter- 
pretations that result from this, R. A. Fisher has developed another statistic 
into which an obtained r can be transformed by formula and which does have 
a normal sampling distribution, even when M is small. The statistic has been 
called z, which we write in bold face to distinguish it from the standard meas- 


= + + 
1.00 —075 -0.50 -0.25 0 +0.25 +0.50 +0.75 +1.00 


-30 -2.0 -1.0 0 +10 + 0 
d E 2.0 +30 
Scale of Z 
Fro. 9.5, Distributions of sample coefficients of correlation when N is very small and when 
the population correlations are .00 and .80. Corresponding to them are distributions of 


Fisher's z coefficients. Conversion of r to z brings about symmetrical sampling distri- 
butions, regardless of the size of r. 


urement z. They are definitely not the same statistic. Figure 9.5 shows 
sampling distributions of z as compared with those of corresponding 7’s on 
their respective scales. 

The range of z is from — œ to +0, but when r reaches the value 995, 2 is 
still short of the value 3.0. Up to an rof .25, z and r are approximately equal. 
Even when > = .50, z is no larger than .56. Within the limits from —.50 to 
+.50, then, distributions of r are essentially normal, if N is not too small. 
Beyond that range, when normal distributions are important, it would be 
well to convert r to z. The transformation formula is 1 


2 У 006, (1 +17) — log ( шл ыо Pinta’ о 14) 


where log, stands for a logarithm to the base e or refers to the Napierian sys- 
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tem of logarithms.! In terms of logarithms in the common system (to the 
base 10), 
z = 1/1513 [logis (1 Fr) logu (1—7)] * [Same as (9:14) in terms: (9.15) 


of common logarithms] 


For general practice, Table H (Appendix B) may be used for the transfor- 
mation of r to z and also z tor. One would not report final results in terms 
of z but would convert back to the more familiar 7 value. For example, to 
find a confidence interval for an obtained r, if r were large it would be best to 
transform r to z, determine the desired confidence limits for z (using the SE of 
z, whose formula is about to be given), then find the r's corresponding to those 
zlimits. In Chap. 13 we shall also see that z is brought into use in averaging 
coefficients of correlation. 

The standard error of z, unlike that for r, is uniform for all values of z (with 
N constant). It can be estimated by the formula 


1 
gc — (Standard error of z) (9.16) 
* XN-3 
The SE of z can be interpreted and used like any statistic that has a normal 
distribution. It would be preferred, along with z, in testing the significance 
of a difference between coefficients of correlation. 


Tue RELIABILITY OF DIFFERENCES 


Of much more practical value than the standard errors of means, propor- 
tions, and the like are the standard errors of differences between means and 
between proportions and the like. In experimental practice, we are per- 
petually comparing measured results under two conditions that we arbitrarily 
set up. We ask such questions as to whether the eye is more sensitive during 
stimulation of other sense organs or in the absence of such stimulation; 
whether boys or girls are more capable in a test of perceptual speed; whether 
one method of teaching subtraction is superior to another in terms of resulting 
efficiency. This calls for one set of measurements under the one condition 
and another set under the other condition and a comparison of means. The 
statistical question is, “ How reliable is the difference between means?" 

The Standard Error of a Difference between Uncorrelated Means. Again 
reliability is indicated by a standard error. 'The amount of fluctuation in a 
difference between sample means is naturally related to the amount of 
fluctuation in the means themselves. The simplest relationship is given by 
the formula 

2 SPRUNG Suc os difference between un- (9.17) 

1 For the benefit of the mathematically sophisticated student, 2 is the hyperbolic arc 

tangent of r, or z = tanh r. 
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where c, = SE of the mean of the first distribution and см, = SE of the 
mean of the second distribution. This relationship holds only when the two 
sets of measurements are independent, i.e., uncorrelated. When we are dealing 
with matched groups, for example, particularly when individuals are matched 
pair by pair, the formula will have to be modified. But more of that later. 
Let us apply formula (9.17) to a typical problem. A group of 114 men and 
a group of 175 women were given the same word-building test in which the 
score is the number of words built out of six letters in 5 min. The results are 
given in summarized form in Table 9.4. The women's mean of 21.0 is 1.3 
points higher than that for the men. This mean difference is very small 
numerically, but in view of the relatively large number of cases in the two 
samples, we should expect the obtained means to be very close to the popula- 
tion means, and perhaps therefore it indicates a real sex difference. The 


TABLE 9.4, MEANS AND OTHER STATISTICS IN THE COMPARISON OF MEN AND 
Women IN A WORD-BUILDING TEST 


stability of each mean is indicated by its SE, which is .572 in the case of the 
men and .371 in the case of the women. 

Just as sample means are distributed normally about the population mean 
when N is large, the sample differences between means are also distributed 
normally. The central value about which the differences between means 
fluctuate is also a population value. We do not know what that population 
value is. We are most concerned, first, in determining whether there is any 
difference at all, and second, in determining its approximate size. 

The statistical tests connected with differences, in principle, are very much 
like those we encountered in connection with correlation coefficients. Since 
most differences are small, we first make a test to see whether we are justified 
in rejecting the null hypothesis. The null hypothesis in this case is the sup- 
position that in the population there is no real difference. Stated in another, 
and more acceptable, way, the null hypothesis is that the two sample means 
arose by random sampling from the same population, same, that is, with 
respect to the variable measured; the two groups from which the two samples 
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were drawn are obviously different in other respects, otherwise we should not 
have raised any question of a difference at all. 

In accordance with the null hypothesis, then, we assume a sampling dis- 
tribution of differences, with the mean at zero, or at M, — И, = 0.0. The 
deviation of each sample difference, Mı — Ma, from this central reference 
point is equal to (Mı — М») — (M, — Мз), or Mı — М, —0. The deviation 
of each difference given in terms of standard measure would be the deviation 
divided by the standard error, which gives us a value. In terms of a formula, 

z= mi (A 3 ratio for a difference between means) (9.18) 
м 
The numerator, to be quite complete, should read M; — М» — 0, as was 
stated above. But since the zero has no contribution to make to the compu- 
tation, it is dropped in ordinary practice. It will help the investigator using 
this formula to think more clearly if he remembers that, logically, zero belongs 
there. 


Upper extreme 
0.005 of the 
expected Z's 


Lower extreme 
0025 of the 
expected 25 


-20 1.0 0 +10 +20 

Scale of Z-ratios 

— — 

Mi“ M is negative Ж-М, is positive 

Fic. 9.6. A sampling distribution of ? with a mean of 0, which corresponds to a hypothetical 
difference between means equal to zero. Shaded areas show the regions of extreme 2’s, at 
the left those significant at the .05 level, and at the right those significant at the ,01 level. 
Obtained 2's (either positive or negative) in those extreme regions аге interpreted accordingly. 


+30 


Figure 9.6 shows a sampling distribution of 2 ratios.! This distribution is 
real, though rarely derived by using actual data, since every difference we 
obtain by random sampling, with 's constant, provides its own 2 value. 
We could actually take a series of 100 or more paired samples, compute 
М; — М, for each pair, cay for each pair, and consequently a 2. The fre- 
quency distribution of the 100 or more 2’s we could set up from those data 
would approach the distribution in Fig. 9.6. 

Testing the Null Hypothesis. For the word-building test we have the infor- 
mation (see Table 9.4) that the difference in the obtained sample is —1.3. 


! In some textbooks and in reports of research, this ratio of a deviation toa standard error 
is called a “critical ratio,” symbolized by CR. 
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The algebraic sign of the difference does not concern us at this time; we are 
interested in its amount. The standard error сам = .682. From this, 


ме (2728 
# = Gey = 191 


The value 1.91 tells us how many са,’ѕ the obtained difference extends from 
the mean of the distribution. The mean, under the null hypothesis that is 
being tested, isa differenceof zero. Since the sample is large, we may assume 
a normal distribution of the 2’s. The obtained z fails by just a little to reach 
the .05 level of significance (which for large samples is 1.96); consequently we 
would not reject the null hypothesis and we would say that the obtained 
difference is not significant. There may actually be some difference, but we 
have not enough assurance of it. There are more than 5 chances in 100 that 
a difference as large as this one, or larger, could have happened by random 
sampling from the same population—same with respect to word-building 
ability. A more practical conclusion would be that we have insufficient 
evidence of any sex difference in word-building ability, at least in the kind of 
population sampled. Note that the conclusion was not stated to the effect 
that we have demonstrated that there is o sex difference in word-building 
ability. We cannot prove the truth of the null hypothesis; we can only demon- 
strate its improbability, 

Had the 2 test turned out very significant, ż.e., with less than 1 chance in 
100 that by chance a 2 could be so large, we should then have been interested 
in the size of the difference.! Our interest would then have reverted to the 
standard error of the difference and the probable limits it suggested for the 
size of the difference. This procedure is so similar to that for determining 
the probable size of any population parameter that we need not go through 
the steps here. A confidence interval and confidence limits could be set up by 
the usual procedures. Actually, we may do this even if the difference proves 
to be insignificant. 

The Standard Error of a Difference in Correlated Data. When the data 
are so sampled that there is a correlation between the means in the two vari- 
ables measured, 2.e., so that the means in pairs of samples tend to rise or fall 
together (positive correlation) or tend to be contrasting so that when one rises 
the other falls (a negative correlation), the SE of a difference is estimated by 
the formula 


Tay = мм, + o's, iG. (SE of a difference between (9.19) 


correlated means) 


There is some custom for referring to a difference or a deviation that is significant at or 
beyond the .01 level as being “very significant”; one significant between the .05 and .01 
levels as being “significant”; and one significant below the. Os level as being “insignificant.” 
The more usual practice, however, is to state the probabilities, 
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which is like formula (9.17) except for the last term, in which ri; is the corre- 
lation between the two sets of means. 

Fortunately, under the usual circumstances of random sampling, the corre- 
lation between the two sets of means is approximately equal to the correlation 
between two sets of single measurements in two samples. Since weordinarily 
have only two samples with two means from which we could not compute 712 
between the means, this fact is a great convenience. But in order to compute 
the correlation between single measurements, we must have the individual 
measurements in the two samples paired off two by two in some manner. 
For example, if the same group of students takes the same word-building test 
twice instead of two different groups taking it, we have the same individual's 
score in the first trial to pair off with his score in the second trial. Or if, in 
comparing males and females in the test, we want to standardize our two 
groups better by taking a brother and a sister from each family or if we pair 
boy with girl with respect to age, JQ, or social status, or all such factors, then 
if these factors of common family, common age, ГО, or social status have any 
relation to word-building score, they automatically introduce correlation into 
the two samples. We compute a coefficient of correlation in the manner 
described in Chap. 8 and introduce it into formula (9.19). 

In Table 9.5, we find two sets of knee-jerk measurements, both from the 
same 26 men but under two conditions. In the first case (T), the subjects 
were squeezing a hand dynamometer just before the stimulus struck the knee, 
and in the second case (R) the “relaxed” knee jerk was obtained under a 
relaxed, sitting posture. Will the average man show a real difference in 
height of knee jerk under the tensed condition, as theory would lead us to 
expect? The two means, with a difference of 3.39 deg., suggest that the 
theory is vindicated. But we want to be sure that this large a difference 
could not have happened by random sampling from a population of measure- 
ments in which the actual difference is zero. 

If we were to assume no correlation between the tensed and normal meas- 
urements of knee jerk, we should apply formula (9.17), or we should apply 
formula (9.19) with an гуз equal to zero, which is actually the same thing. 
Such a ca, turns out to be 2.37 deg. of arc. The z ratio is 3.39/2.37, or 1.43. 
This 2 falls decidedly short of the .05 level of significance. We should con- 
clude, erroneously, that although there is some difference in the expected 
direction, it is not a significant one. So far as these indications go, we should 
not be called upon to reject the null hypothesis; the difference of 3.39 could 
represent merely a result of random sampling. 

When we compute a coefficient of correlation between the two sets of meas- 
urements, we find it to be +.82. This means that the men came rather 
closely in the same rank order in both the tensed and the relaxed conditions. 
If a man has a high kick under normal conditions, he will be likely to have a 
correspondingly high kick during the tensed conditions. If a man is low in 
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the one case, he is likely to be low in the other. If the sampling is random, 
there would be a similar correlation between means under the two conditions. 
If another group of 26 men had a higher normal average response than this 
one, it would be likely also to have a higher average tensed response. 

When means rise and fall together, they tend to maintain the same differ- 
ence between them. In the case of a perfect positive correlation (r = +-1.0), 
the difference between means would remain exactly constant. If all the 
sample differences between means were identical, their dispersion would be 

TABLE 9.5. STRENGTH OF THE PATELLAR REFLEX UNDER Two CONDITIONS, TENSED 

AND RELAXED, FOR 26 MEN, AND DIFFERENCES BETWEEN THEM 


(Measurements Are in Terms of Degrees of Arc) 
— — 


T R T—R 
Tensed Relaxed | Difference 
31 35 — 4 
19 14 +5 
22 19 +3 
26 29 — 3 
36 34 T2 
30 26 +4 
29 19 +10 
36 37 — 1 
33 27 +6 
34 24 +10 
19 14 +5 
19 19 0 
26 30 = 4 
15 7 +8 
18 13 +5 
30 20 +10 
18 1 +17 
30 29 +1 
26 18 +8 
28 21 +7 
22 29 = 7 
8 4 +4 
16 11 T5 
21 23 — 2 
35 31 +4 
26 31 ES 
E 653 565 +88 
M 25.12 21.73 3.39 
с 7.17 9.45 5.50 
ом 1.43 1.89 1.10 
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zero and са, would equal zero. We should then be almost certain of a differ- 
ence in the obtained direction. A correlation of 4-.82 is less than 1.00, how- 
ever, and so there is still some room for variability among the differences. 
But from the line of reasoning just completed, we can see that the ca, is going 
to be smaller than it turned out to be when we assumed an r equal to zero. 

By the use of the complete formula (9.19) we find the оа, to be 1.10, 
which is less than half the previous estimate of 2.37. The 2 ratio is now 
3.39/1.10 = 3.06. A # above 3 is obviously in the “very significant” 
category.! We therefore feel very confident that there is a genuine difference 
in favor of the tensed conditions. This is not saying that we feel confident 
that the actual difference is exactly 3.39; it might be more or less than that. 

Since we might have expected the results to be in this direction, a one-tail 
test could have been made. If the investigator were to predict a difference 
in this direction in advance, he would make a one-tail test instead of a two- 
tail test. He would test the hypothesis that the mean difference is zero or 
negative. His significance level would be either .025 or .005 for a positive 
deviation of 1.960 or 2.580, respectively. The subject of one-tail tests will 
be discussed more fully in Chap. 10. 

Observations Should Often Be Paired. In setting up an experiment with 
two groups of subjects or two groups of measurements for statistical compari- 
son, it is well to pair off cases two by two if possible, so that a correlation can 
be computed. 

Often when such pairing is not actually carried out, there would be correla- 
tion between the means of the samples anyway; the full formula for the SE 
of a difference cannot then be applied, and the са, by formula (9.17) is over- 
estimated. It is true that under these circumstances, if the correlation is 
positive, we can say that the correct са, is smaller and that the correct 2 
ratio is larger than the one we estimated. When we havea significant or very 
significant # under the circumstances, we can be sure that the Z we would 
obtain by taking into account the positive correlation would be even larger. 

One difficulty is that when the # obtained under these circumstances is too 
small to be significant we cannot conclude anything in particular. Least of 
all can we conclude that the actual difference is probably zero. For had we 
considered the correlation, we might have found a significantly large z. The 
process of matching and the inclusion of the correlation factor in the cay 
formula are said to increase the power of the test. By this is meant that the 
test is more sensitive to a difference when it is genuine. As a result, we are 
more likely to avoid the error of accepting the null hypothesis when it is 
incorrect. 

In pairing off individuals or observations, it is important that the pairing 

1 А sample of 26 pairs of observations would be regarded as a small sample by most 


сара А small-sample 7 test would lead to the same conclusion in this instance, 
owever. 
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be done on some meaningful basis. It will not pay to do any pairing except 
on the basis of some trait that correlates with the measurements on which the 
two groups are going to be compared. For example, if we were to compare 
two groups of boys as to ability to do a high jump, one group after training of 
a certain kind and the control group without such training, it would be impor- 
tant that the two groups be equated as to age, among other things. Ability 
in the high jump, regardless of training, would be dependent upon age, hence 
correlated with it, but the ability is probably not correlated significantly with 
a grade earned in arithmetic, and so there would be no point in matching the 
groups on this variable. 

Matching Groups. The basis upon which to match groups having been 
decided, there are two common ways of carrying out the matching. One is 
by pairing cases directly. In the problem just mentioned, for every boy of 
10 years 6 months in the one group, we would seek a boy of like age in the 
other. Small discrepancies may be permitted at times between pairs. If 
there are about twice as many cases in the one sample as in the other, match- 
ing two boys to one would be the solution. 

The other common way of matching groups is to ignore individuals as such 
and simply to attempt to make sure that the two samples have approximately 
equal means, standard deviations, and skewness on the matching variable. 

A SE of a Difference Obtained Directly from Differences. When individuals 
have been paired off, we can find the desired statistics directly from differences 
between pairs. In Table 9.5 we find the difference in knee-jerk measure- 
ments (T — R), given with algebraic signs, for every individual. If we sum 
them and divide by N, we obtain the mean of the differences, which is equal 
to the difference between the means. If we calculate the SE of the mean of 
these differences, we have gay. The og, is thus obtained in the most direct 
manner. We do not even need to know the SE’s of the two means ог the 
amount of correlation present, yet our direct procedure has taken these 
things into account. 

The ou, for the knee-jerk data obtained in this manner is identical with that 
which we found previously, as it should be. The interpretations and con- 
clusions concerning the mean difference are the same as usual. This more 
direct method is very strongly recommended whenever it can conveniently 
be applied. 

The Reliability of Differences between Proportions, Frequencies, and 
Percentages. Consider the data in Table 9.6. Here we have the propor- 
tions of 400 men and of 400 women students who judged two words as 
"pleasant" or “very pleasant." The two words were “to explore" and 
“symphony.” Here we can raise two questions concerning each word. Is 
there any sex difference in the proportion judging the word “pleasant”? 
And within each sex, is there a significantly greater proportion of “pleasant” 
judgments for one word than for the other? The differences themselves sug- 
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gest that the men favor the word “to explore” slightly more than do the 
women, the difference in proportion being .0075. The women decidedly 
more often favor the word “symphony,” with an excess of .2025 over the 
proportion of the men who judge it pleasant. The men find the word “to 
explore" more pleasing than they do the word “symphony” by a margin of 
.1925, and the women, on the other hand, find the word “symphony” more 


TABLE 9.6. PROPORTIONS or 400 Мех AND 400 Women Wuo Јорсер tHE Worps 
“то EXPLORE” AND “SYMPHONY” PLEASANT; DIFFERENCES AND STANDARD ERRORS 
OF DIFFERENCES; AND T RATIOS 


Differ- 
ence 


.1925 
‚0175 


.342 
.395 


to their liking than “ќо explore" by a small margin of .0175. Which of these 
differences, if any, are significant or very significant according to the rules 
we have been following? We can test any or all of them for statistical 
significance. 

The Standard Error of a Difference between Proportions, The standard error 
of a difference between two proportions is given by the formula 


04, = Vo, + oy, — 2TH, (SE of difference between proportions) (9.20) 


where op, = SE of the first proportion 
p, = SE of the second proportion 
711 = correlation of proportions in pairs of samples! 

Again, it is fortunate for us that, when sampling is random, the correlation 
between proportions is equal to the correlation between single cases. The 
latter we can estimate from the data. In Table 9.6, we find that the correla- 
tion between men's judgments of the two words is given as +.342 and the 
correlation for the women is ＋. 305, since both words were judged by the same 
individuals. But in the comparison between sexes, there was no pairing of 
individual judgments in any known way, and so we may assume that the 
correlations are zero. 

On this basis we find the oz, between men and women for the word “to 
explore” to be .0235. The obtained difference of .0075 here yields a 2 ratio 
of 0,32, which is decidedly not significant. The sex difference on the word 
“symphony” gives a са, of .0281, which yields a ? ratio of 7.21. This is so far 

1 This correlation should be derived from samples as a ф coefficient, or the correlation 
of two genuinely dichotomous variables (see Chap. 13). 
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above the .01 level that we are very confident about its being true that college 
women (like those in the sample) find “symphony” more pleasant than do 
college men (like those in the sample). 

Men also decidedly prefer “to explore" to “symphony,” with the highly 
significant 2 value of 8.23. Women, however, who find “symphony” more 
pleasing than “to explore” by an excess of .0175, do not give any strong indi- 
cation that the true difference is in this direction, for the 2 ratio is only 0.97. 
The results are somewhat in line with what we should expect, but it can be 
ventured that some differences that we expected to be true did not prove to be 
significant and perhaps do not exist at all; for example, where we might have 
expected a difference between sexes on “to explore," a significant one failed to 
appear. 

Differences between Percentages and Frequencies. Similar tests of signifi- 
cance can be made for differences between percentages and frequencies. The 
uses of percentages and frequencies are here completely analogous to the use 
of proportions, as they have been in other connections. An illustration of 
how to test either of these differences will therefore not be given. 

The Reliability of Differences between Standard Deviations. If we are 
concerned about differences in variability in two distributions as measured 
by о, we can make statistical tests of significance somewhat like the ones 
already illustrated. The formula for the standard error of a difference 
between o’s is 


6 SEof a di b - 
па Mein Per — eee, (ang dehnten en und. (0,21) 


It is especially to be noted that the vn in this equation, unlike its appearance 
in others, is squared, for it has been proved that the correlation between 
standard deviations in pairs of samples is equal to the square of the correlation 
coefficient between individual pairs of measurements, hence the squaring in 
formula (9.21). 

We may apply this formula to the data in Table 9.4 for the word-building 
test. Here we find the men more variable than the women by a difference of 
6.08 — 4.89, or 1.19 points. Is this difference significant, or could it have 
arisen as а natural deviation from an actual difference of zero, t.e., equality of 
the sexes in variability? The oie proves to be .476 (the correlation being 
zero) and the 2 ratio is 1.19/.476, or 2.50. The difference of 1.19 points there- 
fore just fails to pass the hurdle of significance at the .01 level. There is just 
more than 1 chance in 100 that, if the two sexes are equally variable in this 
test, such a large discrepancy between their standard deviations could have 
occurred by sampling. Just failing to “pass the hurdle,” however, should 
not be stressed too much. The amount of difference obtained is a very rare 
occurrence and strongly suggests the inference that there is a real sex differ- 
ence in variability in the word-building test. 
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Under the heading of small-sample statistics will be found a radically differ- 
ent method for testing a difference between two standard deviations. With 
small samples the test given above breaks down completely for lack of normal 
sampling distributions. A “small sample" in this connection is an M less 
than 100. 

Reliability of Differences between Coefficients of Correlation. If we have 
two coefficients of correlation, rj» and r34, that have been obtained from inter- 
correlating two pairs of variables and we want to test whether they could have 
arisen from the “same population” by random sampling, by analogy to other 
formulas, the standard error of a difference between r's is estimated by 


(SE of the difference be- 


tween two coefficients of 
Oar = V Tra T — 27 ru, correlation with no com- (9.22) 


mon variable) 


where c, = standard error of ri; 
в = standard error of за 
Truru = correlation between samples, or ria and 734 

The estimation of the correlation of r’s can be made by means of a very long 
formula involving ris, r14, 723, and ra, as well as ri; and ra which makes this 
procedure forbidding. With no variable in common to the two r’s being 
compared, it is likely that the r between r’s will be rather small. When one 
of the variables in the гә correlation is very highly correlated with о... in the 
fa correlation, however, the r» correlation would probably be of sufficient 
size to call for its use. 

The type of problem in which the average reader will be likely to test differ- 
ences between r’s is one in which one of the variables is common to the two 
correlations. This calls for a different correlation of correlations (see formula 
9.23). For this reason the reader is referred elsewhere for the method of 
estimating r,,,,. Without using the correlation term r,r, one can sometimes 
reject the null hypothesis with confidence, because 2 is underestimated, but 
sometimes one could not feel very sure that he should not reject it if rr is of 
substantial size and is not used. 

In experimental investigations in which we study the change in correlation 
(perhaps reliability or validity) of a measuring instrument under different 
conditions, one or both of the correlated variables is likely to enter into both 
correlations. We determine the validity correlation for a test with and with- 
out scoring weights using the same outside criterion. We compare the 
validity coefficients of two similar verbal tests, also against the same criterion. 
For such a situation we would be testing the difference between two correla- 
tions гуз and зз, where variable X; is common to both. If we substitute ri; 
for the correlation r34 in formula (9.22), we can estimate the standard error ou, 
for these two correlations. The correlation of the 7's would be rrr, This 


1 


1 Peters and Уап Voorhis, ор. cit. Р. 185. 
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correlation can be estimated by the formula. 


„ Kull — £n — rtu — nu Enna) > 9.23 
poe M, 20 — rhy — riu, i 923) 
(Correlation between two r's having 


one variable in common, 


^ 

The z Test of Differences between r's. Remembering that there are doubts 
about the use of standard errors of r's when correlations are large and when 
samples are not large, it would be well to consider testing differences between 
z coefficients instead. Unfortunately, no one appears to have found a way 
of estimating correlations between paired samples of zs. We must therefore 
be limited to problems in which r,, is very small or zero, as when the two cor- 
relations being compared arose from rather independent variables. 

With this limitation, the standard error of a z difference is 


1 1 (SE of a difference between two z 
-E Sulla mo 
Consider two r's, ris = 82. and ris = 92. The corresponding 2 coefficients 
(from Table Н) are 1.16 and 1.59, respectively. N, = 50 and Nz = 60. 
From these data, 


va VN + Ит = 197 
1.59 — 1.16 
and i= CUu 248 
From this result we should feel more confident than usual that the difference 
is significant beyond the .05 level. For had we taken into account a possible 
positive correlation between the z's, the £ ratio would have been larger, giving 
us a more powerful test of the difference. 


Some SPECIAL PROBLEMS OF RELIABILITY AND SIGNIFICANCE 


Tn this section we shall consider some modifications and applications of the 
sampling statistics already explained. These have to do with unusual 
sampling situations and a common experimental design in which changes are 


compared. 

Sampling in Stratified Population. Stratifying, in sampling, tends to 
stabilize the dispersion of sample means and of other statistics, preventing 
their scattering as much as would be true in a completely random sample. 
Consequently, the ом that would be derived in the usual manner would be an 
overestimate. Such a standard error is a too-conservative index of statistic 
fluctuations. 

Certain corrective procedures have been developed for the case of stratified- 
random sampling. The most general and serviceable formula for the SE of a 
mean is 
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E of a mean correc i 
wc caen (9.25) 


i 
where g? = variance in the total sample and о%„ = variance among means of 
ibgroups. Each subgroup is a sample representing a stratum, within which 
here has been random sampling. It should be pointed out that the variance 

is a weighted affair, i.e., the contribution of each set of data to the variance 
in proportion to its size. The formula for this is 


~ 
= MON — M) + NMa — Му + s + NM = 20 (926) 
(Weighted variance of means of sample sets) 


where Ns, Ns, . . . , Ny = numbers of cases in sets 1 to k, respectively 
V= NI ＋ N. 1 N. n 

M = mean of composite sample 

Similar formulas apply to the SZ of a proportion. 


(SE of a proportion corrected for stratification) (9.27) 


where p = proportion observed in the entire sample, all strata combined 
a фт» 
N = number in total sample 
o, = variance of strata proportions about p 
The solution for c*, needed in formula (9.27) is given by the formula 


„= lupi — Ю + Mala Dc NL A 029 
(Weighted variance of sets of sample proportions) 


where fi, pz, . . „ pe = proportions observed in different sets 
Ny, Na, . . , № = corresponding numbers of cases in sets 
N = total number of cases 
i p = proportion in composite of sets 
Sampling Statistics in Matched Samples. In some investigations, there is 
restriction in sampling brought about by matching. Experimental and con- 
trol groups are often equated in some respects while studying the effect of 
some varied condition upon a measured outcome. Groups are frequently 
“equated” for such matching variables as chronological age, mental age, /0, 
socioeconomic level, or for initial score on some particular task or test. 

As in the case of stratified sampling, it pays to match samples only on 
variables that are correlated with the measured variable—the variable on 
which we note the experimental outcome. The matching may be by pairs 
(for example, for every individual of a certain kind in the experimental group 
there is a similar one in the control group) or by total group (ensuring that 
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the means, standard .deviations, and skewness are practically the same for 
the matching variable in the two groups). 

SE of a Mean in a M. 1 Sample. It is logical that if we try to keep suc- 
cessive samples constant ith respect to the mean on some variable positively 
correlated with the experimental variable, the means on the latter will also 
be kept more constant, depending upon the extent of the correlation. The 
standard error of a mean should then be smaller under this restriction. The 
general formula is 


(SE of a mean corrected for (9.29) 


yix VN -1 PE effects of matching) 


where fmz is f. correlation between the matching variable and the experi- 
mental variable. 

Inspection of formula (9.29) will show that the first factor, с/у У — 1, is 
the customary standard error. What the second factor, AA = rin, does is 
to modify downward the size of the standard error. The larger r becomes, 
the greater the correction effect. The correlation has to be as high as .866 in 
order to make the correction as much as .50, in which case the SE is half as 
large as it would be without matching. The same change in см could be 
accomplished by increasing the size of sample four times without matching. 
When rms is .707, the reduction is equivalent to that obtainable by doubling 
the size of sample in random sampling. These two examples give some idea 
of the economy of measurement to be achieved by matching samples. 

If the matching has been done on the basis of more than one variable, the 
correlation called for in formula (9.29) is the multiple correlation (see Chap. 
16) between a combination of the matching variables and the experimental 
variable. In the combination the components should be weighted according 
to a multiple-regression equation. If the weights depart from the optimal 
ones indicated by this procedure, the correlation-of-sums formula may be 
applied (see Chap. 16). Matching on the basis of many variables does not 
ordinarily pay unless the matching variables are themselves relatively inde- 
pendent, t.e., uncorrelated with each other. 

Sometimes a sample group is matched on the same variable, as when we 
give a pretest and a posttest, with intervening experience or practice. In 
this case, the paired cases are identical individuals. The variability of 
means to be expected from successive sampling of this kind is indicated by 
the following estimate of the SE. 


ом = I (SE of a mean for matching on (9.30) 


с 
NY the experimental variable) 


where rer is the test-retest reliability (see Chap. 17) of the experimental 


М 
` 
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variable. The reader who may be familiar with the reliability statistics 
described in Chap. 17 will recognize the product o VI — ræ as the standard 
error of measurement of individuals. Dividing thi by the square root of 
degrees of freedom should indicate the lispersion of means of measurement.* 

The SE of a Proportion in a Matched Sample. The same principles just 
discussed in connection with means also apply to proportions. When sam- 
ples have been matched on the basis of Жолы variable correlated with 
the categorical variable on which the proportion is based, by analogy to 
formula (9.29) we have ‘ . 


ср = J (1— rn (SE of a proportion in matched samples) (9.31) 
ж 

where u is the correlation between the matching variable and the experi- 
mental variable. The coefficient should be a point-biserial r (see Chap. 13). 
The matching variable could be a composite, as in the case of means. If the 
matching variable is the experimental variable, the correlation term should 
be the reliability coefficient, 2, by analogy to formula (9.30). SE's of per- 
centages and of frequencies would be estimated by simple modifications of 
formula (9.31), when samples are matched. 

Sampling Statistics from Finite Populations. The discussions of sampling 
statistics thus far have pertained to the general case of infinite populations. 
At least, the populations have been assumed to be very large relative to the 
size of the samples. 

In some situations the population may be finite and not many times as 
large as a sample. This restriction means that successive samples have a 
much better chance of containing identical individuals. This leads to greater 
similarity of means. If the size of the population is known, we can take it 
into account in estimating the SE and hence obtain a more realistic figure for 
it. A serviceable formula is . 


= d N (SE of a mean corrected for size 
kc Ni | N of population) (9.32) 


where Ny is the number in the total population and other symbols are as 
usually defined. 

It can be seen that, as Vp becomes very large compared with N, the correc- 
tion term under the radical at the right approaches 1.0, and the SE is then 
estimated by the customary formula. When the sample contains one one- 
hundredth of the population, the value of the factor at the right reduces to 
995. The SE is then only 14 of 1 per cent lower than it would be without 
the correction. 


1 For further information on SE’s in matched and other restricted samples, see Peters and 


Van Voorhis, op. cit. Pp. 132-135. 
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A similar formula for the SZ of a proportion obtained from a finite popula- 
tion reads 


where the symbols are defined as previously. , : 

The last two formulas are useful in the case in which we want to know 
whether the sample we have obtained is representative of the population 
with respect to some statistic and its corresponding parameter. For exam- 
ple, we cannot sample all the students who have taken certain mathematics 
courses at a certain university in a certain year. We select those whom we 
can get, which is to say that we have an "incidental" sample, mentioned in 

the early part of this chapter. 

One thing we can do is to ask whether the sample is like the total popula- 
tion with respect to variables that are probably correlated with the experi- 
mental variable, for example, age, scholastic-aptitude score, grades in mathe- 
matics, еіс, Whether they are correlated we can determine from the sample 
that we have, Such correlations are not as likely to be affected by biased 
sampling as are means, variances, and skewness of distributions. 

In testing the significance of a difference between the population parameter 
and the corresponding sample statistic, we would take the former as a fixed 
value and determine the significance of the departure of the statistic from it. 
The 2 ratio would be formed from the difference divided by the SE of the 
sample statistic as computed by the formulas just given. Although the 
population is finite and even not very large, we actually know its parameters. 

The Significance of Differences between Changes. In experimental work 
we very frequently have a design involving the comparison between an experi- 
mental and a control group. The two groups are probably selected to begin 
with by matching them, either person to person or group to group, with 
respect to some quality or qualities. The experimental group is given treat- 
ment A; the control group is not. There is a final test, by which the members 
of both groups are measured. 

Let us suppose that the final test is identical with the initial test on which 
matching was effected. The experimenter's chief interest is therefore prob- 
ably centered on the amount of change in the experimental group as com- 
pared with that in the control group. How can he best reach a decision about 
this comparison of changes? 

In the experimental results there are essentially four means, and among 
these four means there are, altogether, six differences. The two means from 
the initial tests (which we may call M, and Ma, for experimental and control 
groups, respectively) may be compared to determine whether matching has 
been successful. A test of statistical significance of a difference between 
these means would be justified only if no matching operations had been 
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applied; only if the two groups were chosen at random from a pool. Formula 
(9.17) would apply. 

We also have two means from the final tests, M,» and М.з, for the experi- 
mental and control groups, respectively. The comparison of these two 
means, if that is the crucial test adopted by the experimenter, would be made 
using formulas (9.17) and (9.18), if sampling was not matched. If matching 
has been done person to person, the test of the significance of this difference 
should, of course, take into account the correlation term in formula (9.19). 

If the matching has been done in terms of means and other statistics, the 
following formula will apply: 


сы = M Cu, + с) — 72) Е ar te 7 for matched (9.34) 


where ғ: is the correlation between the matching variable and the experi- 
mental variable. If the two variables are one and the same, as in the illustra- 
tion above, substitute the reliability (test-retest) coefficient rz; (but do not 
square it) for 7%. Note that the SE of the two means used here should not 
have been computed by formula (9.30), since the latter involves the correction 
for matching. To use such SE’s in formula (9.34) would effect a double 
correction. я 

Comparison of the means M.: and М, is not the best way to reach a con- 
clusion. It will give us a statistical inference regarding those two outcomes 
but not necessarily an answer to the question for which the experiment was 
designed. Suppose that the experimenter could reject the null hypothesis. 
Perhaps there was also a corresponding real difference latent in the original 
test. Perhaps sampling errors did not permit this difference to show up in 
the difference between means Ma and Ma. Remember that we cannot 
prove the truth of a null hypothesis. 

Another approach that the experimenter might take is to compare first and 
second means in each group. He might test the significance of the differences 
Ma — Maand Mj; — Ma. If the former is significant but the latter is not, 
he might conclude that there is a genuine difference in behavior changes in 
experimental and control groups; that the experimental group changed but 
the control group did not. Such a conclusion would not be safe. Again, 
we do not know whether the two groups actually started on a par, since we 
cannot prove the null hypothesis. If the two groups changed in the same 
direction, which is a common result where learning is concerned, the fact that 
one change is significant and the other not may rest on a very small difference 
in the £ ratio. It is the met difference in change in which we should be 
interested. It is the sampling errors in this difference that should determine 
our conclusion. None of the comparisons mentioned thus far takes into 
account all possible sampling effects. 

What we need, then, is a statistical test of the difference between changes. 
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The simplest approach is to treat the changes, whether they are mean changes 
or individual changes, as if they were single measurements, at least to think 
of them as such. "There are several ways of estimating the standard error 
of the mean, net change, depending upon how the two groups were formed. 

If we let D, stand for the mean change for the experimental group 
(D, = Ma; — Ма) and let D, stand for the mean change for the control 
group (D, = М — Ма), we are testing the significance of the difference 
D. — D, If the two groups were chosen at random, we apply formula (9.17), 
having determined in the usual manner the SE's of D, and D.. If the two 
groups have been matched person to person, it is best to determine paired 
change values and apply either formula (9.19) or the alternate procedure 
described in connection with Table 9.5. 


Exercises 
1. Compute the standard errors of the means for Data 9A, and interpret your results. 
Determine confidence limits at .05 and .01 levels. State the confidence intervals. 


Data 94. RESULTS FROM A TEST OF THE ABILITY TO NAME FACIAL EXPRESSIONS 
IN THE RUCKMICK PHOTOGRAPHS 


2. Compute the standard errors of the means for Data 9B, and interpret your results. 


Data 9B. Quantity WRITTEN IN SENTENCE CONSTRUCTION FROM 10 SETS OF 
Taree Nouns ЕАСН AND 10 Sets or THREE Уєквѕ EACH 
Measurement Is the Number of Sentences Written in a Limited Time. 
Subjects Were 55 Girls. 


Statistic | Nouns Verbs 


туу = .67 


3. Compute the standard errors of the medians for Data 94, and interpret your results. 

4. Compute the standard errors of the standard deviations in Data 9A and interpret 
your results, 

5. Compute the standard errors of the frequencies of passing students in Data 9C, and 
interpret your results. Do the same in terms of percentages and proportions. 
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Dara 9C. NUMBER OF STUDENTS IN Two Свосрѕ Мно PAssep Each OF THREE 
Ітемѕ IN AN INTRODUCTORY PSYCHOLOGY EXAMINATION 


— — — 


Group I| Group П 


NV. с 37 65 
Item A. onnee ense 24 26 

тав = .19 
Item B.. nae 33 32 

rec = .32 
Item C 30 44 

rac = .25 


————— 


6. The correlation between an interest score and the degree of satisfaction in a certain 
vocational assignment was .43 in a sample of 101. Find ør, ary the confidence limits at 
levels Os and Ol, and the ? ratio (when? isassumed to bezero). Interpret your results. 

1. Transform the r of Exercise 6 into Fisher's z, compute the SE of z, determine the 
confidence limits at the .05 and .01 levels, and transform the limits to corresponding 
values. Compare these limits with those found in Exercise 6 and explain any differences. 

8. Estimate the SE of the difference in means for Data 94, also for Data 9B, and make £ 
tests, Interpret your findings. 

9. Using the SE's found in Exercise 8, determine the confidence limits and confidence 
intervals at ће .05 and .01 levels for the differences between means. 

10. Determine the significance of the difference between SD's in Data 94. State your 
conclusions. 

11. Determine the significance of differences between groups Iand II in Data 9C for the 
three items, in terms of proportions ofcorrect answers. Interpret your results. 

12. Determine the significance of differences between frequencies passing items A, B, 
and C for group П in Data 9C. Interpret your results. 

13. Assume that Data 94 are in a stratified-random sample. Compute the SE of the 
mean for a combined sample on the basis of this assumption. The SD of a composite of 
the two distributions is 3.38. Compute the SE of the mean also from this information. 
Compare the two SE’s and account for the direction of the difference. 

14. Assume that the same 55 girls of Data 9B repeated very similar tests with the follow- 
ing means: 27.1 and 23.5, for nouns and verbs, respectively. The two SD’s on the second 
occasion were 5.12 and 5.04, respectively. The corresponding reliability coefficients 
(test-retest) were .87 and .75. The intercorrelation between the two tests on the second 
occasion was .60. Compute the following statistics and interpret your results: 

a. The SE's of the means on the second testing. 

b. The SE's of changes in scores in the nouns and verbs tests, with Z ratios. (Do not 
take the reliability coefficients into account more than once.) 

c. The SE of the difference between the two tests on the second occasion, with a 4 
ratio. 

d. The SE and the 2 ratio for the difference in mean changes in the two tests (assum- 
ing the correlation between changes to be zero). 


Answers 


1373 24T. Limits: .05—20.4, 21.8; 21.5, 22.5; 01—20.1, 22.1; 214, 22.6. 
2. ом: .859; .738. 
3. oman: АТ; 34. 
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222.20; 11. 


. of: group I—2.90, 1.89, 2.38; group II—3.95, 4.03, 3.77. 


oy: group 1—.077, .051, .064; group II—.060, .062. .058. 
(ср = 1002). 


. а, = .082. Limits: .05 level—.23, .63; .01 level—.17, .69;0,, = .10; 2 = 4.30. 


z= 463 = .102. Limits: .05 level—.25, .58; .01 level—.20, .62. 
448; 3 = 2.01. 

Data 9B: Tay = .926; 3 = 2.05. 

Data 94: limits, .05—0.02, 1.78; limits, .01— —0.26, 2.06. 

Data 9B; limits, .05—0.09, 3.71; limits, .01— —0.49, 4.29. 


. ode = .315; Z = 1.49. 
- aa: .098; .080; .086. 


3: 2.54; 5.00; 1.56. 


- ody: AB—5.37; AC—5.11; BC—5.06. 


3: 1.12 3.52 2.37. 


‚ om (stratified) = .209; см (composite) = .210. 
« d. ом: 251; .343. 


b. cay (nouns) = .838; c (verbs) = .797. 
2 (nouns) = 2.86; ® (verbs) = 0.88. 

с. тау = 359; 2 = 10.03. 

d. са, = 1.156; 2 = 1.47. 
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CHAPTER 10 


TESTING HYPOTHESES 


We have already seen that experiment and statistical method go hand in 
hand. The one supplements the other. The experiment directs our observa- 
tions and yields data, which are usually expressed in terms of numbers. By 
means of statistical methods we can summarize those data, determine their 
reliability and significance, and draw inferences and conclusions. 

Some experiments are designed very simply to answer questions such as, 
“Tf I do this, what will happen?” Such experiments are exploratory. The 
end result is usually in the form of hypotheses, which need further investiga- 
tion. A higher type of experiment is one that sets out to test the truth or 
falsity of some hypothesis. From previous experience, derived from an 
experiment or not, we suspect that a certain relationship exists, but it requires 
a crucial test to enable us to accept or reject the hypothesis. If the crucial 
experiment comes out one way, the hypothesis is probably correct; if it comes 
out another way, the hypothesis is probably wrong. 

A decision as to whether the experiment came out one way or the other or 
whether the result is inconclusive may rest heavily on a statistical inference, 
as we saw in the preceding chapter. A difference between means is positive 
or negative; but could this outcome be one of the chance deviations from no 
difference at all? The conclusion regarding a fact about nature rests upon a 
decision in the form of a statistical inference. 

In this chapter we shall attempt to find more mathematical meaning in the 
idea of statistical inference. By broadening our conception of it we can 
increase its usefulness. We shall see how the tests of significance that were 
applied to large samples in the preceding chapter can also be applied to small 
samples, with certain modifications. 


PROBABILITY MODELS IN STATISTICS 


The Role of Mathematics in Science. Concerning the great value of 
mathematics in science there can be no argument, if we view the development 
of science as a whole, culminating in modern theoretical physics. Whether 
or not we believe that the universe, including man and his behavior, is con- 
structed along mathematical lines, the application of mathematical ideas and 
forms in describing it is an undeniably profitable practice. 

According to one popular view, mathematics (and this includes statistics) 
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is an invention of man rather than a discovery. It exists entirely in the realm 
of ideas. It is а logic-tight system of elements and relationships, all of which 
are univocally defined. It is a completely logical language that can be 
applied to the description of nature because events and objects of nature have 
properties that parallel mathematical ideas, at least to a sufficient degree. 
If the description of nature in mathematical terms is never completely exact, 
there is enough agreement between the forms of nature and the forms of 
mathematical expression to make the description acceptable. The approxi- 
niation is often so close that once we have applied the mathematical descrip- 
tion we can follow where mathematical logic leads and come out with deduc- 
tions that also apply to nature. 

Take, for example, the normal distribution curve. This is a mathematical 
idea, purely and simply. It is incorrect to refer to it as either a biological or 
a psychological curve. It is a particular mathematical model that happens 
to describe groups of natural objects so well that we can often use the proper- 
ties of the normal curve to make inferences and predictions about those 
objects or groups, as we have been doing in many of the preceding chapters. 
We need now to become more highly conscious of this truth in preparation 
for what follows. We shall meet with some other statistical models and we 
shall put them to work also. 

Statistical Model for a Null Hypothesis. In the preceding chapter we had 
incidental references to null hypotheses. Here we shall see a number of other 
applications of them. We very properly say “null hypotheses,” in the plural, 
for there are many ways of stating a null hypothesis, depending upon the 
experimental problem. In very general terms, this kind of hypothesis merely 
states that in an experimental situation, or even in a nonexperimental situa- 
tion, whenever things are enumerated or measured it is assumed that nothing 
but the laws of chance are operating in a free and unrestricted manner. An 
illustration from experiments on extrasensory perception (ESP) is very 
suitable. 

Suppose that an experiment with the Duke University ESP cards is prop- 
erly designed to prevent the receiver from being influenced by any cues except 
possible telepathic stimulation. There are five different symbols on the 
cards, and in a thoroughly shuffled deck they should come up at random. 
As each one comes up and an experimenter reads it silently, the receiver 
makes his judgment. The card is returned to the deck, which is reshuffled, 
and the next one to be transmitted is selected. 

Starting with the hypothesis that there are no factors (including ESP) at 
work to determine the receiver's responses, we should expect in the long run 
an average success of 20 per cent right, or 1 in 5. If any receiver gives an 
excess of correct responses over and above 20 per cent, we still have to deter- 
mine whether this excess is significant or whether it could have occurred by 
the processes of sampling in his limited number of trials. If the excess is one 


си. 10] TESTING HYPOTHESES 205 


that could have happened as much as once in 10 times (one sample of this size 
out of 10 such samples), we should still say that the null hypothesis is quite 
plausible. We could not say that it is established; but we would by no means 
give it up. Even if the excess over 20 per cent were one that could happen 
less than once in 20 samples, though we should be more skeptical of the null 
hypothesis, we should be unjustified in completely rejecting it. When so 
large a discrepancy as we obtained could occur by sampling less than once in 
100 times, we customarily reject the hypothesis. We then say that it is 
highly implausible. 

But note that this does not automatically lead us to conclude that the 
alternative (ESP) hypothesis is true. It does tell us that something other 
than guesswork is going on, but it does not tell us what that “something other 
than guesswork” really is. If our experiment is designed so as to exclude all 
other possible factors than ESP in this case, then, having reduced the crucial 
experiment to an either-or proposition, i.e., either laws of chance or ESP, and 
having overwhelming indication that the chance hypothesis is wrong, we can 
accept the ESP hypothesis as true. 

Unfortunately, the identification and control of all other factors favoring 
correct responses here are exceedingly difficult. But, in general, the estab- 
lishment of an experimental fact depends upon them. We shall see shortly 
how a statistical test of the null hypothesis can be made for this type of 
experiment; but first let us consider some simpler cases. 

Expression of a Null Hypothesis in Terms of Probabilities, Our first exam- 
ple is a simple psychophysical test situation. А student asserts that he can 
distinguish between two tones whose stimuli differ only 2 cycles per second. 
That is his hypothesis: that he possesses genuine power to discriminate this 
difference in pitch. We doubt him, thus automatically adopting a null 
hypothesis. Out of six trials, how many pairs should we require him to judge 
correctly before we give up our hypothesis and yield to his? Our hypothesis 
implies that when he judges the pair of tones he might just as well flip a coin 
and report “second higher” for “heads” and “second lower” for “tails,” 
We should expect him, by such guessing, to be correct half the time, or three 
times out of six. But how much of an excess over three correct judgments 
will it take to convince us that he is not merely guessing? 

In a set of six trials, there are seven possible outcomes —all the way from 
six down to zero correct judgments. In Table 10.1 are listed all the seven 
possibilities and the probability of each event's occurring by random sampling 
(chance). According to the probabilities involved in the situation, we should 
expect only one “score” of 6 in 64 samples; we should expect six "scores" of 
5, 15 “scores” of 4, and so on. These expectations are according to the laws 
of probability. 

A Binomial Distribution as a Statistical Model. The distribution of fre- 
quencies or of probabilities in Table 10.1 is called a binomial distribution. 
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Taste 10.1. EXPECTED OCCURRENCES AND PROBABILITIES OF SPECIFIED NUMBERS 
or CORRECT JUDGMENTS IN MAKING Six JUDGMENTS AT RANDOM 


Number of | Times expected PE of Probability of | Probability of 
correct in 64 sets dear as many or as few or 
ү z occurring in S j 
judgments of judgments 3 more occurring | less occurring 
random sampling 
6 1 | 1/64 1/64 64/64 
5 6 6/64 7/64 63/64 
4 15 15/64 22/64 57/64 
3 20 20/64 42/64 42/64 
2 15 15/64 57/64 22/64 
1 6 6/64 63/64 7/64 
0 1 1/64 64/64 1/64 


The reason for this will be explained in a moment. In the preceding chapter 
we used the normal distribution exclusively as the model in making tests of 
the null hypothesis. While the binomial distribution resembles the normal 
one in form, they are mathematically not the same. As the number of 
“coins” is increased, the binomial distribution approaches the normal form 
more and more closely. But note that the binomial distribution is composed 
of discrete “scores.” The probabilities change by jumps rather than by 
gradual transitions, as in the normal distribution. There are many situa- 
tions, of which our psychophysical-judgment problem is one, in which the 
binomial distribution serves best as the model for chance events. Another 
situation would be a rat in a two-choice discrimination-learning experiment 
or a student in answering a limited number of multiple-choice examination 
items. 

The Binomial Expansion. A mathematical way of deriving the proba- 
bilities for the seven scores is to apply the expansion of the binomial 
(1/2 + 1/2). In tossing a coin there are two possible, independent out- 
comes, head or tail. The theoretical probability of a head occurring is 1/2, 
and the probability of a tail is also 1/2. The general expression for the 
binomial is (p + q)”, where u is the number of coins tossed. Heads and tails 
exhaust the possible outcomes for the mathematician’s coin, so that p + q = 1. 
Now 1 to any power equals 1, so that the binomial (p + q)” = 1.0 

The generalized binomial expansion is 


( ＋ 90% =" + Te» qt 1 2 1) por gi 


n(n — 1)(n — 2) 50 n(n — 1) — 2)(n — 3 
a e нени 


Te Tg (10.1) 


in which рапа д can have any positive values so long as p + = 1. 
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Applied to the problem with six coins (n = 6), 
Lese 1\° тү" (Norco Ly: 
G+) O 600 (+9900) 
+ Sx8x Dy (Dy ere FY GG) 
VV 
000 
IX2X3XA4X5MQ/M 2 


О ECT 
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If the seven fractions are summed, the result is equal to 1. The probabilities 
coincide with those in Table 10.1 for the various scores. The numerators 
give the expected frequencies of thé scores 0 to 6 inclusive, when the total 
number of scores is 64. 


GENERAL PROBLEMS or Hypotuesis TESTING 


Although we did much about testing hypotheses in the preceding chapter, 
we did so at a rather superficial level. We shall now go more deeply into the 
matter, for deeper understanding of the problems and the principles involved 
is necessary before considering a greater variety of applications. Some 
things mentioned in Chap. 9 will be essentially repeated, but they will bear 
repeating. There are many qualifications to be made to things already 
presented. 

Testing Deviations from Expected Values. In determining whether the 
student’s hypothesis about his acuity for pitch discrimination has much claim 
for acceptance, we are interested in how far his obtained score deviates from 
that to be expected by chance. The most probable chance score in this 
situation would be three correct judgments out of six. How much deviation 
from a score of 3 does he need in order to lead us to reject the null hypothesis? 

A score of 6 would be expected one-sixty-fourth of the time. One chance 
in 64 would seem to lie between the .05 and .01 levels that are commonly 
applied as standards. If this were a one-tail test, we should reject the 
hypothesis at this level of significance if the obtained score is 6 (in six trials). 
We shall have to digress a bit to consider the logic behind a one-tail versus a 
two-tail test in this situation. 

One-tail versus Two-tail Tests. If we begin the experiment with the 
general position that either this is a pure chance situation or it is not, we have 
a two-tail proposition on our hands. Of the not-chance alternative there are 
iwo possible outcomes—an extreme positive deviation or an extreme negative 
deviation. Either outcome falls into a single logical region, in spite of the 
fact that the two occupy opposite tails in a distribution. If this is the logic 
with which we start, a deviation provided by a score of 0 is just as probable 


208 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION (cn. 10 


and just as significant asa score of 6. The confidence level attached to the 
occurrence of either score, 0 or 6, would be 2/64, or 1/32. The two-tail test 
thus also leads to a rejection of the null hypothesis here beyond the .05 level, 
but not as far beyond the. OS level as in the case of the one-tail test. 

In this psychophysical problem, an obtained score of 0 would be interesting 
to interpret. If we had adopted the .05 level of significance, what does a 
score of 0 mean? It surely does not mean indication of ability to make cor- 
rect auditory judgments. It might be argued that a score of 0 is even more 
positive evidence of lack of ability than a score of 3 would be. From a 
statistical standpoint, a deviation represented by a score of 0 is just as signifi- 
cant as a score of 6. But where a score of 6 would be taken to indicate 
ability in the positive direction, a score of 0 should be taken to indicate a bias 
of some other kind, toward wrong discrimination. The source of the bias 
would have to be determined from other information or from results of 
another experiment. 

If we adopt the one-tail test here, scores below 3 would be regarded differ- 
ently. We have less interest in them, for one thing. Since the difference of 
opinion with which the experiment started involved just two alternatives— 
the student can sense a difference or he cannot sense a difference—a one-tail 
test seems more logical than a two-tail test. We can well argue that if he 
surpasses our adopted confidence limit we will accept his view. Amy other 
score, then, whether it is in the insignificant range of positive deviations or 
whether it is a negative deviation, of any size, has a similar meaning. It fails 
to indicate support for his hypothesis. The alternative hypothesis, then, is 
supported if the result comes out in a region that includes all scores 0 through 
5. The student's hypothesis is supported if the result comes out in the 
region that includes a score of 6 only. In the two-tail test in this problem, 
the region of acceptance of the hypothesis of a nonchance event includes 
scores of 0 and 6. The region of rejection of the nonchance hypothesis 
includes scores 1 through 5. 

Combining Probabilities in Significant Regions. We concluded that a 
score of 6 would be regarded as significant between the .05 and .01 levels, 
whether we apply a one-tail or a two-tail test. Let us ask whether a score of 
5 would be significant, in either case. 

The test to be made is not whether a score of precisely 5 is significant, even 
though it makes sense to state the probability of obtaining a score of exactly 
5, neither higher norlower. The policy here is to follow the practice of asking 
whether a certain amount of deviation is rare enough to be rejected as a chance 
event. In the illustrative problem, this means asking whether a score of 5 
or higher could have happened by chance. The region for rejection of the 
null hypothesis then includes probabilities for scores of 5 or 6. In terms of 
probability, this region is defined as a combination (simple sum) of the two 
probabilities for scores 5 and 6. This sum gives us the probability of 7/64. 
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In a one-tail test, a score as large as 5 would be significant below the .05 
point.! For a two-tail test we would combine the two tail probabilities of 
7/64 each, giving a probability of 14/64, which is a little less than 1/4. 
Combining the tails of a distribution is another example in which two proba- 
bilities are summed. We have been doing this before without making the 
principle explicit. In general, when we ask what is the probability of either 
event A or event B happening, it is the simple sum of the probability of the 
happening of A plus the probability of the happening of B. 

Another Example on the Binomial Model. We consider next a case with 
a larger number of trials a set of 10 true-false items to which a student gives 
one of two alternative responses, one right and one wrong. How many more 
than five items must he do correctly for us to reject the hypothesis that he is 
merely guessing at random? The probabilities corresponding to the four 
highest scores are given in Table 10.2. These probabilities are derived from 
the application of the binomial (1/2 + 1/2)". 


TABLE 10.2. EXPECTED OCCURRENCES AND PROBABILITIES OF SPECIFIED NUMBERS 
or Correct Responses TO 10 True-Farse Test ITEMS 


m m Probability | Probability 
Number of | Expected | Probability | Probability cf a like | of a like 
number of this of this VA Ho 
correct еи deviation deviation 
right in number by number or y у sk 
responses 1.024 sets pd higher in opposite in either 
al direction direction 
10 1 1/1,024 1/1,024 1/1,024 1/512 
9 10 10/1,024 11/1,024 11/1,024 11/512 
8 45 45/1,024 56/1,024 56/1,024 7/64 
7 120 120/1,024 176/1,024 176/1,024 11/32 


From Table 10.2 we see that a score as extreme as 10 could occur only once 
in 512 attempts. A score of 10 almost certainly indicates some knowledge or 
ability measured by the test. A score as extreme as 9 could occur 11 times 
in 512 attempts, or about 1 in 46 attempts. This would indicate probable 
knowledge or ability but not with great assurance. A score as extreme as 8 
could occur about once in 9 attempts and is consequently not at all fatal to 
the null hypothesis. For a one-tail test, which is more defensible here, we 
would halve these probabilities.” 


1 As а technical discrimination in terminology, we speak of the .05 and .01 points in 
connection with a one-tail test and the .05 and .01 levels in connection with a two-tail test, 
when referring to a deviation. In either case, we speak of significance levels, which are 
specified in terms of probabilities. 

2 For a discussion of the problem of testing whether “runs” of the same response are of 
sufficient length to justify rejection of the null hypothesis, see Grant, D. А. New statistical 
criteria for learning and problem solution in experiments involving repeated trials. Psychol. 
Bull., 1946, 43, 272-282. 
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Departures from Random Conditions. In applying such tests of the null 
hypothesis to any practical situation such as this, however, it must be kept in 
mind that we are assuming that in the event of complete ignorance the 
examinee will guess purely at random. Experience tends to show that in the 
absence of knowledge human beings do not always guess or respond at ran- 
dom. They exhibit patterns of responses or pattern habits. With biases 
such as this in the picture, hypotheses based upon chance distributions must 
be made with great caution and sometimes are precluded. The presence of 
bias cannot be easily detected, but one evidence of it would be a “significant” 
deviation in an unreasonable direction, as when in a guessing situation a 
statistically significant number of wrong judgments or responses occurs. 
Goodfellow has shown in connection with “experiments” on telepathy over 
the radio, for example, when an audience made five successive guesses of 
“black” versus “white” there are a number of common sequence patterns. ! 
Alternations occur less frequently than one would expect by chance; runs are 
avoided; and certain initial responses may be favored, sometimes in response 
to an incidental cue that an experimenter might well overlook. 

The presence of such nonrandom effects is bothersome, but there are 
experimental controls that may help to prevent them. There is probably 
enough randomness under a wide range of behavior to make possible a very 
profitable use of the statistical tests that depend upon it. 

A More General Conception of Hypothesis Testing. From the preceding 
discussions it can be seen that a sampling distribution provides not only a 
model for what would happen in a chance situation but also a division of that 
distribution into regions of rejection and acceptance of alternative hypotheses, 
once we have decided upon a one-tail versus a two-tail test and have adopted 
a confidence level. We shall now put these ideas in more general form. 

Let us say that the hypotheses with which we are concerned have to do 
with a sex difference in verbal-comprehension ability. There are logically 
three possible alternatives: males have more ability than females; females 
have more ability than males; and there is no sex difference in ability. 
Expressed in terms of symbols, these three alternatives may be stated: 
My > My, Mm < My, or Mm = Му, where the M's stand for means and the 
subscripts obviously for male and female. 

An investigator who approaches this problem with an open mind simply 
asks, “Is there a genuine sex difference here?" He would make a two-tail 
test. He would combine the first two alternatives into the form Mm ғ Му. 
His two alternatives would be this hypothesis against the hypothesis Mm = My. 
He is prepared to find and to accept a significant deviation in either direction, 
for either satisfies the hypothesis Mm ғ M;. He also accepts the algebraic 
sign of his obtained difference as being meaningful, since, if he were to mark 


1 Goodfellow, L. D. The human element in probability. J. gen. Psychol., 1940, 24, 
201-205. 
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off confidence limits for the difference, the confidence interval would be 
entirely or predominantly on one side of the point of zero difference. Using 
the information provided by the algebraic sign, he could not only reject the 
null hypothesis but make a decision between the two alternatives included in 
the hypothesis Mm # Mj. 

An investigator who has some strong hunch in favor of either the hypothesis 
M, > М, от the hypothesis M, < My, either from the logic of the situation 
or from previous experience, or both, would make a one-tail test. If he 
believes that females are superior to males in verbal-comprehension ability, 
he would reduce the situation to two alternatives, as usual, by combining the 
other two alternatives. Thus, it would be the hypothesis M., < M; versus 
the hypothesis Mm > Му, where the latter means that the mean of the males 
is equal to or greater than that of the females. 

The reduction of three alternatives to two is a simplifying step so far as 
decision is concerned. The two alternatives are sometimes indicated by the 
symbols Ho and Hi. Ho represents the hypothesis that certain defined 
chance events are operating in the sampling situation, and H, represents the 
hypothesis that something other than chance events is operating. In the 
two-tail test, Ho is naturally called a null hypothesis, since it is accepted when 
deviations are relatively close to a zero point. In a one-tail test, the Ho 
hypothesis may be accepted even when there are large deviations. Under 
the latter circumstances, it does not seem proper to refer to Hy as a null 
hypothesis. The conception of Ho is therefore broader than that of the 
null hypothesis, thus opening up possibilities of other types of hypothesis 
testing. 

Hypothesis Testing with the Normal-curve Model. In the previous 
illustrations, we actually counted up the total number of possible outcomes 
and also the number of times certain outcomes would be expected, and from 
these we obtained directly the probabilities that the null hypothesis was 
incorrect. There are other instances, when the number of responses we 
deal with is quite limited, in which a similar counting of cases can be done 
and the probability of extreme deviations from chance can be derived. 
When the number of possible outcomes is not small, however, this counting 
of cases, or even algebraic computations of permutations and combinations, 
is much less efficient than other methods that will be described next. 

In a certain elementary-psychology laboratory experiment, we have the 
problem to determine whether students can perceive from photographs 
whether or not a man has been convicted of crime. Pictures of 20 pairs 
of men matched for certain qualities are exhibited, and the student judges 
which of the two is the criminal. The null hypothesis calls for 10 correct 
Tesponses, provided that only random guessing accounted for the score. 
How large an excess is indicative of actual perception or of something 
other than chance? 
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To solve this problem, we do not resort to counting up the probabilities 
of as many as 20, 19, 18, etc., or more correct responses. Rather, we assume 
that each set of 20 judgments is a sample and that such samples would have 
a mean of 10, and a standard error of this mean will be the SE of a frequency, 
which equals 4//N pq [see formula (9.11)]. We also assume a normal dis- 
tribution of the samples of frequencies! For this problem, M is 20, p 5 
and g is. S. The ay is therefore 4/20 X .5 X .5 = 2.236. The distribution 
of these frequencies is shown in Fig. 10.1, with a mean of 10 and a ø of 2.236. 
We are now ready to ask about the probability of a randomly determined 
score being as high as X or higher. For example, would a score of 14 be 
significantly in excess of the expected score of 10? 


Frequencies 


10 135 145 155 
1570 201 24 
Score scale 
Fic. 10.1, Standard- score distance from the hypothetical mean of the integral scores 14, 15, 
and 16 (correct judgments out of 20) when each judgment has an even chance of being 
right or wrong on the hypothesis of complete ignorance. 


Correction for Discontinuity. At first thought, a score of 14 seems to 
mean a deviation of four units above the mean of the distribution. But 
remember that a score of 14 on a continuous scale actually occupies the 
interval from 13.5 to 14.5. A score of “14 or above” in this case therefore 
takes in all the normal curve above the point 13.5. It is a different matter 
to ask what is the area under the normal curve for a score of 14 or higher 
and to ask what is the area above the point 14.0. We need what is called a 
correction for discontinuity, because we are substituting the normal dis- 
tribution, which is on a continuous scale, for the binomial distribution, 
which is on a discrete, or discontinuous, scale. 

The deviation of the lower limit of a score of 14 from the mean in this 
problem is 3.5 units, Dividing this deviation by e, which is 2.236, we have 
a 2 equal to 1.57. Going to the normal-curve table (Table B) with this 2 
value, we find the proportion of the area above the point 13.5 to be .0583. 
We do not reject the hypothesis of no ability when the score is no higher 
than 13. 


1 Remember that the sampling distribution is actually binomial here. We may use the 
normal distribution as our model only because the approximation is sufficiently close. 
Even so, we have to make a minor correction, which will be explained. 
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А score of 15, which begins at 14.5, is 2.01с above 10, and the probability 
of a chance score this high or higher is 0223. Such a deviation is significant 
between the .05 and .01 points. 

A score of 16 is 2.460 above the mean and has only about 7 chances in 
1,000 of occurring by guesswork. If all secondary cues, 7.e., cues not having 
to do with objective signs of criminality versus noncriminality in the photo- 
graphs, were eliminated, we could conclude that the student who earns a 
score of 16 probably has some ability to make this kind of discrimination. 

How Large a Deviation Is Significant? To return to the ESP problem, 
in 50 trials, when the probability of chance success is .20 and so the expected 
frequency is 10, the standard error of the frequency is 


v50 X .2 X .8 = 2.83 Ў 


We could now test ће plausibility of the null hypothesis in the face of differ- 
ent numbers of correct responses in excess of 10. But it might be more 
pertinent to ask how large a score it would take to be significant at the 
usual levels.“ 

To be significantly in excess of 10 (in a one-tail test) a score of X or larger 
could happen by chance only 5 per cent of the time. What point on the 
score scale comes at such a position? From the table, the corresponding 
to this point is 1.64. This value times c is 1.64 X 2.83 units on the score 
scale. This excess added to 10 is 14.6. Remembering that a score of 15 
begins at 14.5, we conclude that a score of at least 15 is required to be sig- 
nificant beyond the .05 level. For the .01 level of confidence, the z value is 
2.33, and in terms of score units the excess is 6.6. This gives a score value 
of 16.6. In terms of whole numbers, a score of 17 or better is required for 
significance at the .01 level. 

How Large a Sample Is Necessary for Significant Deviations from Null 
Hypotheses? We have already raised and answered the kind of question 
that asks, for a given size of sample, how large a discrepancy is necessary 
for significant and very significant deviation from a null hypothesis. Here 
we face a little different kind of question. We let our relative excess remain 
constant and ask how large № must be in order for that same size of dis- 
crepancy to reach the critical levels. 

In a survey like the Gallup poll, for example, one would constantly be 
faced with the question of how large a sample to obtain, how many inter- 
views to make, how many responses to a stimulus to record. That mere 
numbers in a sample, as such, are not sufficient to guarantee predictive 
ability was brought home to us decisively by the unhappy Literary Digest 


1 The sampling distribution here is also binomial and could be generated by the expansion 
of the expression (1/5 + 4/5)*. When p # .5 the distribution is skewed, but we may 
substitute the normal-distribution model, since the product Np is as large as 10. This 
meets a criterion adopted in Chap. 9. 
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poll of 1936. Though the votes sampled ran into the millions, the voters 
who really determined the outcome of the presidential election were not 
adequately represented in the sample. A good poll sees to it that every 
kind of group of voters where group differences count at all are propor- 
tionately represented in the poll. When this is accomplished, it is sur- 
prising to the uninformed person how small a total sample can yield a valid 
predictive index. In other words, it is not so much enormous numbers that 
count as how the sample is composed. 

Let us assume that our sample is properly composed, with good repre- 
sentation.! Let us assume an issue where majority vote is decisive. Our 
null hypothesis is then 50 per cent, or a proportion equal to .50. We ask 
first how large a,sample is needed to give us confidence that an obtained 
vote of 55 per cent in favor of the proposition means a majority sentiment 
in that direction and did not occur by random sampling from a population 
that is on the fence. If a discrepancy of as much as 5 per cent is to be 
significant in our accepted meaning of the word, 5 per cent must deviate 
as much as 1.900 from the mean of a normal distribution,? In terms of 
proportions, the deviation is .05; how large must ср be? Obviously it 
must be such that .05 is 1.96 times ø. øp is therefore equal to .05/1.96, 
which equals .0255. The formula we need is 


N= 24 (Size of sample needed for significant deviation) (10.2) 
E 
We know p and q and c; already. Substituting them in the equation, we 
have 
5 Х.5 25, 


N = “955? — 100065025 ~ 384 


to the nearest whole number. It is therefore а 19 to 1 bet that when а 
vote comes out with 55 per cent in favor of an issue in a sample of 384, the 
population sampled is not evenly divided on the question. 

But where much is at stake, we should not be satisfied with these odds 
against the null hypothesis. We might ask how many votes need to be 
sampled to assure us of a very significant deviation. In this case, the excess 
of .05 must be at 2.5760 from the mean. The о must be .05/2.576, which 
equals .0194, Applying formula (10.2) to determine N, we have 


t For the case of stratified sampling that is usually applied in public-opinion polling, 
modifications in line with standard-error formulas that fit that situation should be applied 
(see Chap. 9) rather than the general one for completely random sampling that is illustrated 
here. 

2 А two-tail test is indicated here, since the proportion favorable is as likely to go below 
.50 as above .50. 
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S дА qur 
N = “91942 = 00037636 ~ Gus 


Thus, in a sample of 664 interviewees, a majority vote of 55 per cent would 
be regarded as very significant. The odds would be 99 to 1 that the senti- 
ment of the population sampled is not evenly divided on the issue. And 
since the deviation is in the direction favoring the issue, we strongly expect 
future outcomes to be in the same direction, but we do not know by how 
much. Setting up confidence limits would be somewhat informing. 

The sizes of samples just found are surprisingly small in view of the 
enormous populations that vote on national issues and whose sentiment 
they may be expected to estimate. The reason is that we have allowed a 
rather wide margin of .05 as the deviation from null hypothesis. In deal- 
ing with more vital issues, where close elections are concerned, excesses of 
.01 or less may be decisive. If we are interested in the sizes of sample 
required to give significant and very significant indications when the vote 
is .51 to .49, the SE of the proportion must be one-fifth as large as it was 
for a .55 to .45 division. If e; is one-fifth as large, o, is one twenty-fifth 
as large. In this particular problem, the numbers to be substituted in 
formula (10.2) are now the same except that the denominator is one twenty- 
fifth of its former size. This makes № twenty-five times as large as before. 

For a deviation of .01 to be significant now, № must be 9,600 and to 
be very significant it must be 16,600, these numbers being 25 times 384 
and 664, respectively. Samples of this size would give us great assurance, 
granting random sampling, that the sentiment is in the direction indi- 
cated. On many issues, of course, the sentiment is more unevenly balanced 
than .55 and .45. And, again, when we are interested in significance of 
changes in sentiment, we have a revision of our problem, for then we are 
dealing with differences among proportions. 

Significance Levels and Errors of Statistical Inference. Thus far we 
have not considered very seriously the question of what significance levels 
to adopt. Since we have control over this act, we need some rules and 
logical defenses for the standards of significance we use. 

Some investigators adopt a standard of significance in advance of the 
study or experiment. This lays down the rule for decision-making before- 
hand and makes it easy when the time comes. One disadvantage is that 
there may be temptation to modify the adopted standard after the results 
arein. Other investigators prefer not to adopt any rigid standard of accept- 
ance or rejection of hypotheses. They are content to observe the level of 
significance reached and to report this fact. There is no need to follow either 
school of thought consistently. 

Two Kinds of Errors in Statistical Inferences. The choice of a standard 
of significance depends very much on the risk we take of being wrong in 
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making a statistical inference. Two statistical errors are possible in this 
connection: 

Type I: rejecting hypothesis Ho when it is true 

Type II: accepting hypothesis Ha when it is false 
Ho is commonly a null hypothesis. 

The probability standard adopted for rejecting a hypothesis is sometimes 
called o. This is invariably.a relatively small value. Common values 
for a two-tail test are .10, .05, .02, .01, and even sometimes .005. The cor- 
responding values of а in а one-tail test are .05, .025, .01, .005, and .0025, 
respectively. These probabilities not only represent a scale of significance 
but also tell us the chances we take of being wrong. Thus, the smaller a 
is, the less risk we take of being wrong when we reject a null hypothesis; 
the less risk we take of making an error of type I. 

But note that, as æ decreases, we also increase the chances of an error of 
type П. As а increases, we increase the chances of an error of type I but 
decrease the chances of an error of type II. 

The crux of the dilemma is how much we want to weight errors of the 
two kinds. The cautious scientist abhors more the error of type I. He 
wants to be rather sure that his finding is not due to chance. That is why a 
is generally so small. And yet, caution can be overdone, resulting in the 
situation that few nonchance conclusions are drawn and few differences and 
relationships are accepted as established. 

Some kind' of balance must be reached. Considerations external to the 
data themselves should be given weight. There may be serious theoretical 
or practical reasons why it would be costly to make one kind of error or the 
other. Thus, the odds, ultimately, cannot be decided on statistical grounds. 
Once the nonstatistical issues have been evaluated, however, the statistical 
standards can be more easily adopted. 

In research on important theoretical issues, such as whether or not telep- 
athy and clairvoyance exist, or whether there is inheritance of acquired traits, 
a higher-than-usual level of confidence (lower а) may well be demanded. 
The potential social impact of conclusions on these questions “justifies this 
practice. If it is a question of the selection of the best of several insecticides 
when one is sorely needed and none does any harm, a larger a might well 
be tolerated. Tf it is the use of a new anesthetic, in a concentration needed 
for effectiveness, and if too much threatens death to the human patient, 
a much smaller а might be demanded. 

In general research practice, where the externally determined risks are 
of little consequence, we may follow a suggestion made by McNemar.' 
He proposes that instead of confining ourselves to a two-choice decision— 
rejection or acceptance of hypothesis Ho—we allow a third alternative, 
that of suspended judgment. That is, we might (1) accept hypothesis HI 

! McNemar, Q. Psychological Statistics. New York: Wiley, 1955. P. 70. 
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when the deviation is significant at the .01 level or better; (2) accept hypoth- 
esis Hy when the deviation falls below the .10 level; and (3) suspend judg- 
ment when the result comes between those two limits. 

The Power of a Statistical Test. The power of a statistical test has to do 
with its ability to reject a null hypothesis when the deviation is of a certain 
size. The subject is a complicated one, which we cannot go into here.! 
There are some important implications for practice to be drawn from it, 
of which we shall take advantage. 

Comparing the alphas as between one-tail and two-tail tests, we can see 
that the former are more powerful than the latter. The same deviation 
has a better chance of being significant in a one-tail test than in a two-tail 
test; hence there is a better chance of rejecting hypothesis Ho. 

In connection with the # ratio, power is increased by any procedure that 
decreases the size of the standard error, whether it is the SE of a mean, a 
correlation coefficient, or a difference between such statistics. A SE is 
made smaller in several ways, as indicated in Chap. 9: by increasing the 
size of sample, by stratified or matched sampling, and by experimental 
controls of other kinds. All these procedures help to detect a difference 
when it is real and hence to avoid generally errors of type II. 


SMALL-SAMPLE STATISTICS 


The distinction between large-sample and small-sample statistics is not 
an absolute one, by any means, the one realm merging into and overlapping 
so extensively the other. If one asks, "How small is N before we have a 
small sample?" the answers from different sources will vary. There is 
general agreement that the division, if there must be one, is in the range 
of 25 to 30. Some place it as low as 20 and others say that anything under 
100 is а small sample. The truth of the matter is that the needs for small- 
sample considerations increase as JV decreases and they may become criti- 
cal somewhere below an V of 30. Sampling distributions depart from the 
normal form more and more as N decreases. This was first realized by 
W. S. Gosset, who published for many years under the mysterious name 
of "Student," and it was later emphasized by R. A. Fisher, who has worked 
out many of the small-sample procedures. 

The Sampling Distribution of /. For small samples, many statistics 
exhibit sampling distributions that depart from normality in various ways. 
Distributions of correlation coefficients, proportions, and of standard devia- 
tions are often skewed. Another important change that affects distributions 
of differences, also, is a change in kurtosis. Kurtosis is apparent in the 
degree of peakedness of the center of the distribution. A normal dis- 
tribution is called mesokurtic, which means neither very peaked nor very 


1 For very complete discussions of this subject see Walker, H. M., and Lev, J. Statisti- 
cal Inference. New York: Holt, 1953. 
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flat across the top. Curves tending toward rectangular form, more or less, 
are called platykurtic. Those more peaked than normal are called leptokurtic 
(see Fig. 10.2). 

Many of the small-sample statistical tests are based upon the statistic 
known as Student's 4. Actually, # is defined as we have defined 2. It is 
the ratio of a deviation from the mean or other parameter, in a distribution of 
sample statistics, to the standard error of that distribution. Either in the 
case of 2 or /, we have a sampling distribution. Imagine that we computed 
the ratio for every single sample drawn from the same population with 
N constant. A frequency distribution of these ratios would be a / (or z) 
distribution. 

The difference between 2 and / is one of degree of generality. Statistic 2 
is normally distributed and is so interpreted. It applies when samples are 
large and sometimes under other restrictions, as when derived from samples 


Fic, 10.2. Comparison of a normal distribution with a leptokurtic distribution when their 
means and standard deviations are approximately equal. 


of porr. Statistic /, on the other hand, applies regardless of the size of 
sample. Where the sampling distribution of 2 is restricted to 1 degree of 
kurtosis, the sampling distribution of ? may vary in kurtosis. Student's ¢ 
distribution becomes increasingly /eptokurtic as the number of degrees of 
freedom decreases (see Fig. 10,3), As the df becomes very large, the dis- 
tribution of ¢ approaches the normal distribution. 

Figure 10.3 shows ¢ distributions with differing df involved. The most 
important feature of a leptokurtic distribution, as compared with the normal 
distribution, for the purposes of hypothesis testing, is the difference at the 
tails. The tails are higher for the leptokurtic distribution. The effect of 
this is that we have to go out to greater deviations in order to find the points 
that set off the regions significant at the .05, .01, and other standard levels. 
From Fig. 10.3 it can be seen that the ¢ distribution for 25 df comes very 
close to the normal distribution, but the one for 9 df definitely does not. 
Before we decide to accept a normal-curve approximation to the ¢ distribution 
when there are 25 degrees of freedom, however, let us consider what difference 
it would make in the significance limits. 
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Significance Limits in the t Distribution. In the distribution of t, significant 
1 values have been determined at the .05 and .01 levels. These are listed 
in the last column of Table D (Appendix B). Reference to that table will 
show that when the df is infinite those two / values are 1.960 and 2.576, the 
same as for the normal distribution. With 1,000 df the critical values are 
different from those figures only in the third decimal place. For 100 df 
there is a little change in the second decimal place. The limits with 100 df 
are 1.984 and 2.626. Rough limits, by rounding, of 2.0 and 2.7 would do 
very well even down to about 30 df. With only 10 df, however, /5 of 2.23 
and 3.17 would be required for the .05 and .01 significance levels. With 
small samples, then, it becomes imperative to consider the changing / values 


Frequency 


Scale of t 
Fic. 10.3. Student's sampling distribution of ¢ for various degrees of freedom. As the df 
becomes infinite, the distribution of ? becomes normal. (After D. Lewis. Quantitative 
Methods in Psychology. Iowa City: Published by the author, 1948.) 


needed for significance. Even when the df is greater than 30, if / turns 
out to be near the critical limits it would be well to refer to Table D to find 
the exact values. 

Fisher’s ¢ Formulas. Fisher has provided several formulas designed for 
the computation of £. We shall first note his formula for use in connection 
With a coefficient of correlation. 

The t Test of a Coefficient of Correlation. In testing the null hypothesis 
for a coefficient of correlation, the required / is estimated by the formula 


= N-2 (The £ ratio for testing the significance of a соећ- 
ar | cu cient of correlation) (10.3) 


where 7 = obtained coefficient of correlation and № = number of pairs of 
observations. 
Applying this formula to a problem in which r — .30 and N = 50, 


[48 
1 = .20 51 2.18 
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The hypothesis that the population correlation is zero can be rejected just 
beyond the .05 level. According to Table D, with the 48 df we have here, 
the two required Ёз are 2.01 and 2.68 for the .05 and .01 levels. 

The t Test of a Difference between Means. When means are uncorrelated, 
the ¢ formula for testing their difference is 


Mi —M: (Fisher’s ¢ for testing a dif- (10.4) 


WA Zx*, + Lz 5 (E FN: d еп uncorre- 
NI T Na 2 №№ 


where M; and M; = means of the two samples 
Da, and Zx% = sums of squares in the two samples 
N, and N: = numbers of cases in the two samples 

The complete numerator should read М, — M: — 0, to indicate that it 
represents a deviation of a difference from the mean of the differences. The 
denominator as a whole is the SE of the difference between means, as the ¢ 
ratio requires. 

In writing the ca, in this form, we take the null hypothesis quite seriously. 
That is, if there is but one population, there should be but one estimate of 
the population variance. In the first term under the radical we have com- 
bined the sums of squares from the two samples (in the numerator) and the 
degrees of freedom (in the denominator) that come from the two samples. 
The expression NI + N: — 2 = (Ni — 1) + (Ns — 1). The effect of the 
second expression under the radical is to give us the SE of the mean difference. 

When two samples are of equal size, i.e., VI = Na, formula (10.4) simplifies 
to 


— A (t for difference between uncorrelated means (10.5) 
Ух? X in two samples of equal size) j 


NN; — 1) 
where №; = size of either sample. 
When means of paired samples are not independent but correlated, the 
best formula to use for deriving / directly from sums of squares is 
Ma 
Хаа 
N(N — 1) 


t= ( for difference between correlated pairs of means) (10.6) 


where Ma = mean of the N differences of paired observations and x4 = 
deviation of a difference from the mean of the differences. 

The procedure implied by this formula was actually applied in connection 
with the knee-jerk data under two experimental conditions (see Table 9.5). 
The number of df to use with ¢ in this case is N — 1, where N is the number 
of Pairs of observations, For the knee-jerk problem N — 26 and there are 


E 
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25 df, which indicates (from Table D) that Ёз of 2.06 and 2.79 are significant 
at the customary levels. The obtained ¢ was 3.06. 

When t Tests Do Not Apply. If there is good reason to believe that the 
population distribution is not normal but is seriously skewed, and especially 
if the samples are small, the / tests do not apply. For skewed distributions 
Festinger, and others, have suggested substitute tests." There are also 
available a number of distribution-free tests, some of which are described 
in the next chapter. 

The reader should also be warned that if the two samples did not arise 
from the same population, so that the variances are homogeneous (differences 
are insignificant), the / test is invalid. The homogeneity of the two variances 
can be tested by making an F test described later in this chapter. Cochran 
and Cox have provided a method for meeting the case of unequal variances.* 

One should also have some hesitation in using these ¢ formulas if the №? 
in the two samples differ markedly. Differing V's do not seem to affect 
similarly the use of formulas (9.17) and (9.19). 

Test of a Difference between Uncorrelated Proportions. When the null 
hypothesis is assumed with regard to two observed proportions, Fisher 
recommends, again, that we use just one estimate of the population variance. 
This requires the use of a weighted mean of the two sample proportions. 
Formula (4.10), previously given, can be employed here. Applied to this 
particular use, the formula reads 


_ Nip + Naps Weighted mean of two samples to estimate a 
В. = ee (мев ч (10.7) 
s Ni Ns population proportion) : 


The test of significance of a difference between proportions here is not 
particularly a small-sample matter. The formula to be given could have 
been stated in connection with large-sample tests in Chap. 9. Fisher's 
formula is given here instead because it follows the principle of using a single 
estimate of population variance, consistent with the small-sample statistics. 
So long as the samples are of sufficient size to justify application of the 
standard-error formulas for proportions at all, we assume normal sampling 
distributions, not / distributions. The formulas are: 


Фі — а (А з for a difference between uncorrelated (10.8) 
[— TOF 9 ti 
Ns NI T Ns proportions) 
9 \ -N N: 
where J. = 1 — Be. 
Я 1 Festinger, L. The significance of difference between means without reference to the 
frequency distribution function. Psychometrika, 1946, 11, 97-105. 


a ? Cochran, W. G., and Cox, G. M. Experimental Designs. New York: Wiley, 1950. 
‚ 92. 
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When the two samples are of equal size, i.e., №; = Ns, 


(10.9) 


where N; = size of either sample. 

The sampling distribution of 2 obtained from these formulas is said to 
approach the normal form closely enough for purposes of interpretation, 
provided that the smallest product of f, or J. times NW, or №» is not less than 
10. If such a product is between 5 and 10, a correction for discontinuity 
can be made by reducing the value of the numerator in absolute size (whether 
it is positive or negative) by the extent of the value 


ч i 1) 1 2 

2 Fi NV. 2N NN: 

If the smallest РЛ ог qN product is less than 5, we can still possibly resort 
to the use of a chi-square test, which is described in Chap. 11. 

Differences between Correlated Proportions. "While formula (9.20) is 
sufficiently general to take care of differences between correlated proportions, 
a more economical way was proposed by McNemar.! The formula avoids 
the necessity for computing the standard errors of the proportions as well 
as the correlation coefficient. 

For a genuine nonzero correlation to exist between the two samples, as 
usual, either the same individuals or objects must appear in both or there 
must be a pairing in some significant manner, as of twins, siblings, or experi- 
mental-control pairs. 

Suppose that we have administered two test items to a sample of 100 
students. Item I is answered correctly by 60 of the group and item II by 70. 
Is item II actually easier than item I? In making the z test to answer this 
question, we must definitely face the possibility of correlation between the 

TABLE 10,3. A FouR-cELL CONTINGENCY TABLE OF FREQUENCIES OF STUDENTS 

Wno Passen or Fairen Each ОР Two Test ІТЕМЅ 


Frequency Table Symbolic Table 
Item II Item II 


Fail Pass Both 


Fail 


Pass Both 


a+b 


Item I 


d с с+4 


Both bd | a+c N 


1 McNemar, Q. Note on the sampling error of the difference between correlated propor- 
tions or percentages. Psychometrika, 1947, 12, 153-157. 
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two items and consequently between the two proportions. To handle this 
problem properly, we need to arrange the data in the form of а four-cell 
contingency table, as in Table 10.3. At the left are the four frequencies 
of those who pass item I and either pass or fail item II, and the frequencies 
of those who fail item I and either pass or fail item II. At the right in 
Table 10.3 are given letter symbols to stand for the four categories. Using 
these symbols, the formula reads 


b—c 
Mb Ye 


(See Table 10.3 for definition of symbols.) 

It will help to ensure the proper application of this formula to note that 
the symbols b and c stand for the discordant cases in the four-cell table. 
In this problem, ö and c stand for individuals who pass one item and fail 
the other. It will help to know that the difference 6 — c divided by N 
equals the difference between pı and f» It is therefore the difference 
between two frequencies, i.e, b — c = Npy — Np». To find the difference 
that is being tested in the numerator of the z ratio is not a new experience. 
The denominator must therefore somehow represent the SE of a difference 
between frequencies with the correlation between them taken into account. 
In this formula, too, there is implied but one estimate of the population 
variance, and it is derived from an average of the sample proportions. 
What we are actually testing with formula (10.10) is whether the change in 
frequencies is significant. 

Solving formula (10.10) as applied to the test-item data, we have 


Hay JO я 


z= (А з ratio for difference between correlated proportions) (10.10) 


= VsF V 
We would infer the difference to be significant between the .05 and .01 levels. 
Item II is probably easier than item I. 
It is informing to see what the outcome would have been if we had applied 
formula (10.9) without taking into account the amount of intercorrelation. 
With р estimated to be .65, 


10 


# = 06500550 
* 100 


From the latter result we would have concluded that the difference was 
insignificant. This demonstrates how a decision may be altered drastically 
when the correlation term in the standard-error formula is taken into account. 
Without it, we run the risk of making an error of the second kind, of not 
rejecting the null hypothesis when it is false. The correlation (¢ coefficient) 
between the two items amounts to +.58. The reader will find that if he 
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lets % = = G0», = ‚002275 and substitutes these with the correlation 
of +0.58 in formula (9.20), he will come out with a a, equal to .0439, which 
gives a £ of 2.29, which is near that obtained with McNemar's formula (2.24). 

One restriction in the application of formula (10.10) is that 5 -+ c should 
be 10 or greater, 

The F Test of Differences between Standard Deviations. For small 
samples, the / test of differences between standard deviations is not satis- 
factory, even with the availability of Student's distribution for £. 

Instead of testing the significance of a difference between two o’s, we can 
test the significance of the ratio of the two variances that correspond to them. 
If we compute the ratio of the larger of two variances to the smaller of the 
two, the larger the difference, the further the ratio exceeds 1.00. The 
ratio is 1.00 when the two variances are equal. If the ratio of the variances 
is significant, the difference between the standard deviations is significant. 

More accurately stated, we do not find the ratio of the variances in the 
two samples, Instead, we find an estimate of the population variance 
from each of the two random samples and from these values compute the 
ratio, We assume the null hypothesis, that the two samples came from 
the same population, and we ask whether two estimates of that population 
variance could differ as much as the obtained ratio indicates. The ratio 
has been given the symbol F and is computed from the formula 


- Jarger Variance (F ratio for testing a difference between two (10.11) 
smaller variance estimates of a population 2) d 


Each of these estimated variances is computed by the usual method: sum 
of squares in the sample divided by the number of degrees of freedom. 
Thís application of the F test rests upon the assumption that the popula- . 
tion is normally distributed. 

A small set of data will illustrate the operation of this procedure. Assume 
that two sets of scores, in one of which N, = 8 and in the other of which 
Мз = 5, have sums of squares Xa, = 132 and Ух, = 26. The degrees 
of freedom are 7 and 4, respectively, and so the estimated variances of the 
population, independently derived, are 18.86 and 6.5. The F ratio is 
18.86/6.5, which equals 2.90, 

The Distribution of F. In random sampling, the distribution of F ratios 
can be predicted from the mathematical relationships. Figure 10.4 repre- 
sents three distributions for the situations with certain combinations of 
degrees of freedom, all of them being very small samples. Especially to be 
noted is the marked skewness of the curves. 

Table F (Appendix B) gives the standard F limits that are significant at 
the .05 and .01 levels of significance when there are different combinations 
of degrees of freedom in connection with each of the two variances in the 
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ratio. For the problem above, the two degrees of freedom are 7 and 4, 
respectively, for the larger and smaller (or numerator and denominator) 
variances. Looking into the appropriate column and row of Table F, we 
find that the two F’s for the two significance levels are 6.09 and 14.98, 
respectively.! The obtained F does not even approach the former of these 
very closely. We therefore do not reject the null hypothesis and decide 
that so far as variance or variability is concerned the two samples could 
well have come from the same population. 


df, "8, df 4 


0 


0.5 1.0 25 30 35 40 


1.5 20 

Scale of F 
Fic. 104. Sampling distribution of Snedecor's F for various combinations of degrees of 
freedom. (After D. Lewis. Quantitative Methods in Psychology. Iowa City: Published by 
the author, 1948.) 


In Chap. 12 we shall see the F test extended considerably to the problems 
of analysis of variance. It is in that connection that the F test justifies 
the recognition that it deserves. The application demonstrated here is 
only one of many. 

Sequential Analysis. There has been developed a procedure that enables 
the investigator to save considerable time and effort by testing for significance 
as he samples. Large differences are likely to prove significant with rather 
small samples. It would be wasteful of experimental effort to accumulate 
more cases than would be needed to give a very significant ог F, When we 
have no advance information as to how large a difference is going to be, we 
do not know how large a sample will be needed to ensure significance at some 
prescribed level. We could, of course, obtain a small sample, test the 
difference, and if it proved significant stop the experiment. If it did not 
prove significant, we could continue the experiment, adding observations 
sampled in the same manner, making successive tests. Eventually, the 
test goes in the direction of one hypothesis or another. The principle is 
applied in the method known as sequential analysis. There is insufficient 


In this particular use of F, however, the probabilities must be doubled, d.e., they are 
10 and 02. The reason is that we arbitrarily placed the larger variance on top. Ву 
chance the ratio could have been as large with the other variance on top in formula (10.11). 
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space to describe the method adequately here. The reader is referred to an 
original source on the subject.! 


Exercises 


1. Suppose that we ask an observer to arrange a series of weights in rank order from 
lightest to heaviest, the differences being very small. If he places them in perfect rank 
order, what is the probability that he could have done so by sheer guessing? No matter 
how many weights ranked, there is only one correct way of doing this. The total number of 
ways the observer could have arranged each number of weights is given below: 


5 
120 


4 
24 


6 
720 


7 
5,040 


Number of weights. . 


3 
Number of orders. 6 


Which perfect orders would be regarded as “not significant,” “significant,” and “very 
significant?" State the probabilities of perfect rank orders by chance. 

2. In a discrimination-learning experiment, a rat has two alternative responses, one of 
which is correct. The correct response is to his right in random sequence. During the 
first 12 trials the rat goes left a total of nine times. Using both the binomial-distribution 
model and the normal-distribution model, determine the probability for a result as extreme 
as this. State your conclusions about the rat. 

3. An observer knows that he will hear one of three speech sounds. He is given the 
three in random order in a total of 30 trials. How many correct judgments must he give 
before we regard his success as significant at the .05 and .01 levels? 

4. A certain examination includes 40 items, each with four alternative responses. How 
large a score must a student make before you feel that he probably knows something about 
the subject? Before you feel that he definitely knows something about the subject? 
Express “probably” and “definitely” in statistical terms. 

5. In a test of five-response items, how many items would you need to include in order 
to have confidence at the.05 and .01 levels that a score of 30 per cent right indicates knowl- 
edge of the subject? How many items for the same confidence levels that a score of 25 
per cent indicates knowledge of the subject? 

6. Compute # for the following combinations of r and N 2 


r 25 25 50 50 
N25 50 25 50 


7. Compute a ¢ for difference between means in the following data: Vi = 11; Na = 26; 
My = 17.5; M: = 14.8; Ухт, = 44; Ex, = 65. The means and variances are uncorrelated. 

8. Compute a / for the following data, testing the difference between proportions: 
М, = 36; Na = 16; pı = 25; ру = 375. 

9. In a certain district 200 voters cast votes in both the 1948 and 1952 elections. Of 
these, 20 switched votes from the Democratic candidate for president to the Republican 
candidate, whereas 10 switched in the reverse direction. Was this a significant trend? 

10. Apply an F test to the two variances for the data in Exercise 7. Interpret your 
result. Was the application of the / test in Exercise 7 justified? Discuss. 


Answers 


1. In each case, р is the reciprocal of number of orders. 
2. Binomial solution: № = .073 (one-tail test). 
Normal-curve solution И = 6; 0; = 1.73;2 = 1.44; р — .076. 


Wald, A. Sequential Analysis. New York: Wiley, 1947. 
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3. M = 10;оу = 2.58. 
Score points significant at .05 and .01 levels: 15.1 and 16.7; approximate integral 
scores required: 16 and 17, respectively. 
4. M = 10; op = 2.74. 
Score points significant at .05 and .01 ievels: 15.4 and 17.1; approximate integral 
scores required: 16 and 18, respectively. 
5. For 30 per cent right: N (.05 level) = 62; N (.01 level) = 107. 
For 25 per cent right: N (.05 level) = 246; N (.01 level) = 425. 
б. 1: 1.24; 1.79; 2.77; 4.00. 
T. t 4.25 (cay = .635). 
8. t = 0.92 (са, = .136). 
9. 1 = 1.83. 
10. F = 1.69; p > .05. 


CHAPTER 11 


CHI-SQUARE AND OTHER STATISTICAL TESTS 


Tn the two preceding chapters we saw that it is possible to cast in statistical 
language some hypotheses concerning natural events. This makes it possible 
to apply certain rigorous statistical tests to the data and eventually to 
arrive at some conclusions regarding the events under investigation. Fur- 
thermore, the statistical tests give us some indication of how much confidence 
to place in the conclusions. Y 

Although the statistical tests covered thus far are quite varied and their 
applications are numerous, they do not provide for all our needs. The 2 
and ¢ tests lack complete generality. One reason is that they rest upon the 
assumption of normal distribution of measurements in the population. 
Another is that they are limited to the evaluation of one statistic or one 
difference at a time. The use of the binomial distribution allows additional 
latitude in the form of population distribution, but its application is limited 
to instances where M is small. 

In this chapter and the next we shall find a considerably expanded reper- 
toire of statistical tests. In this chapter we shall deal with a group of tools 
that have sometimes been called “distribution-free” statistics. The reason 
for this category title is that they rest on no assumption concerning the form 
of population distribution. Another name for this group is “nonparametric 
statistics," probably for the reason that in their use we are not concerned 
with estimates of population parameters. The most important of these 
statistics is chi square, which will receive the lion's share of our attention. 


GENERAL FEATURES OF CHI SQUARE 


Chi square is a general-purpose statistic that has many and diverse applica- 
tions. Its most common use is in connection with data in the form of 
frequencies, or data that can be reduced to frequencies. This includes 
proportions and even probabilities. One important advantage of chi square 
lies in certain additive properties, which make possible the combination of 
several statistics or other values in the same test. Thus, a hypothesis 
involving more than one set of data at a time can be tested for significance. 

By definition, a chi square is the sum of ratios (any number can be summed). 
Each ratio is that between a squared discrepancy or difference and an 
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expected frequency. The discrepancy is between an obtained frequency 
and a frequency expected on the basis of the hypothesis we are testing. 

Chi Square in a Contingency Table. Consider the data in Table 11.1. 
This is called a contingency table because of the two possibly related variables 
(intelligence level and marital status, in this particular problem). Whether 
an individual in the data is married or not may be contingent upon his 
intelligence, or vice versa. We have in the table two samples; one is of 
206 young American males who, when they were in school, had been regarded 
as feebleminded in terms of JQ. Their 7Q’s were in the range 60 to 69. 
The other group is of 206 men of similar age (in the twenties) and of ZQ's 
near 100.! At the time the study was made, the proportions married in the 
two groups were .539 and .408 for the normal and feebleminded groups, 
respectively. Is this difference significant? 

The last question suggests a test of a difference between two proportions. 
This would be one way of testing the difference. Applying a 2 test, we should 
find that the difference of .131 gives a 3 of 2.66, which is significant beyond 
the .01 level. 

Another question that we could ask is, “Is there any correlation between 
being married and level of intelligence in this kind of population?” Being 
married or not married and being normal or feebleminded would be two 
genuine dichotomies, calling for the special correlation coefficient known as ¢ 
(see Chap. 13). The phi coefficient for these data is .13. Is this small cor- 
relation coefficient significant? Such a question normally involves a ¢ test 
of a coefficient of correlation. But this is not a Pearson product-moment 7 
based upon continuous measurements, and so the f test previously seen in 
Chap. 10 will not apply. We can test the significance of phi by making a 
chi-square test. 

Chi Square as a Test of Independence. The null hypothesis for a con- 
tingency table such as Table 11.1 is that there is no correlation; the two 
variables are independent in the population in question. The null hypothesis 
in connection with the 2 test of these data is that there is a zero difference 
in marriage rates. The two null hypotheses are essentially the same in 
that when there is a zero difference there is also zero correlation. 

In the chi-square test, a null hypothesis can be conceived in still a third 
way. It begins fundamentally in the same general manner; assuming that 
the two samples arose by random sampling from the same population. 
Next comes the question, “If this be true, how likely is it that the distribution 
of cases like those obtained could depart as much as they do from a random, 
or chance, distribution?" The four frequencies in the cells in Table 11.1 
are 111, 84, 95, and 122. There seems to be some systematic tendency for 


Baller, W. R. А study of the present status of adults who were mentally deficient. 
Genet. Psychol. Monogr., 1936, 18, 165-244. 
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concentration of cases in two cells: married-normal and unmarried-feeble- 
minded. This looks like a meaningful departure from a random distribution. 

If the distribution were random, what would it look like? We must deter- 
mine the answer to this question, for that is the distribution called for by the 
null hypothesis. The distribution to be expected is determined entirely 
by the marginal totals, i.e., the sums of rows and columns. We take these 
values to be fixed. 


TABLE 11.1. A Comparison OF MEN OF NORMAL JQ with FEEBLEMINDED MEN 
WITH Respect TO MARITAL. STATUS 


Marital status 


Married... 
Unmarried. 


The proportion of feebleminded versus normal was an arbitrary choice 
of the investigator. He wanted equal groups, hence the two frequencies 
of 206. The other marginal totals indicate the proportion married in the 
two groups combined. Those totals are 195 and 217. Within the limitations 
of these four marginal frequencies there is much room for variation in dis- 
tribution of cases in the four cells, Does the obtained variation deviate 
significantly from the frequencies to be expected from the marginal values? 


Taste 11.2. Tur EXPECTED NUMBERS OF MARRIED AND UNMARRIED MEN IN THE 
NORMAL AND FrEBLEMINDED Groups Нар THERE BEEN No DIFFERENCE 
BETWEEN THE Two 


Marital status Normal | Feebleminded| Both 


Computation of Chi Square from a Contingency Table. Reference to 
Table 11.1 shows that, of the total sample of 412, the proportion married 
was .4733, By random sampling from the same population, both normal 
and feebleminded should show the same proportion of married individuals. 
This proportion of 206 would lead us to expect 97.5 cases in the married 
category for both groups. We should also expect the remaining 108.5 
individuals to be unmarried in either group. These expected frequencies, 
fe, are shown in Table 11.2. If we add the columns and rows of Table 11.2 
we find them to be identical with those in Table 11.1. Wherever chi square 
is computed, it is important that the sums of expected and obtained fre- 
quencies coincide. This check should always be made. 
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| TABLE 11.3. DISCREPANCIES BETWEEN OBTAINED AND EXPECTED FREQUENCIES IN 
Tastes 11.1 AND 11.2 


Marital status Normal | Feebleminded 
| Married... .. —13.5 
Unmarried 13.5 


TABLE 11.4. THE CELL-SQUARE CONTINGENCIES FOR THE COMPUTATION OF CHI 
SQUARE RELATIVE TO THE STUDY OF MARITAL STATUS AND INTELLIGENCE 


Marital status Normal | Feebleminded 


Computing Expected Cell Frequencies. In a contingency table of any 
number of rows and columns, the principles of computing the expected 
cell frequencies can be illustrated by the limited 3 X 3 table shown in Table 
11.5. Let the f’s with double subscripts stand for the obtained frequencies. 
The sums of the rows are symbolized by fa, Х/ь, Bfe, etc., and the sums of 
columns by Х/, Ef» Ef, etc. The expected frequency for any cell in row r 
and column £ can be found by the formula 


Zhe) (Zhe 
| J= En (Expected frequency for a cell in row r and column 1 (11.1) 
| TABLE 11.5. SCHEMA AND SYMBOLS FOR COMPUTATION OF EXPECTED CELL 


FREQUENCIES IN A CONTINGENCY TABLE 
ED 


| coltrane Sums of 
Rows 
— C] rows 
1 2 3 
A far Ја fas Xf. 
B м Su м f, 
C fa es Ја OM 
Sums of columns. УЛ Efa Ifs N 


..... ̃— Af E м — 
Let Xf, stand for a sum of any rows, for example, Efe, Efi . . . › etc. 
Let Ху; stand for a sum of any column, for example, Ifi I.. 


Thus, the expected frequency corresponding to fes would be derived from 
the product (Zfj) (Z/;) divided by V. Hence, the expected frequency for 
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row 1 and column 2 of Table 11.2 would be equal to 


(195)(206) _ 40,170 
VS DEA 
Computing Cell Discrepancies. Having the expected frequencies f, we 
now ask whether the observed frequencies f; deviate from them sufficiently 
to cause us to reject the hypothesis of no difference. For each of the four 
cells of the table, we determine the discrepancy f; — fe These discrepancies 
are listed in Table 11.3. It will be seen that except for algebraic sign they 
are all numerically the same. This outcome will be true of all fourfold 
tables of frequencies of this sort, whether the two groups compared have the 
same total numbers of cases or not. This fact can be used to give us short 
cuts in computation, as we shall see later. 
The Cell-square Contingencies. In the solution of chi square, we square 
each discrepancy, divide by the corresponding f, and sum all the ratios. 
The sum is chi square. In terms of a formula, 


= 97.5 


EE 
х = X [e7] (General formula for chi square) (11.2) 


where the symbols have been explained above. Each cell provides a ratio 
of (f, — fe)? to fe which ratio/has been called the cell-square contingency. 
This is merely a convenient name, at present, but later (Chap. 14) it will be 
related to prediction procedures. For now, it can be said that chi square 
is the sum of the cell-square contingencies in a contingency table (see Table 
114). 

The square of the discrepancy 13.5 is 182.25. In two cells, this is to be 
divided by 97.5, which yields 1.87. In the other two cells it is to be divided 
by 108.5, which yields 1.68. Summing twice 1.87 and twice 1.68, we have 
7.10 as the value of x?. 

Interpretation of a Chi Square. The number 7.10 stands for the total 
amount of discrepancy between hypothesis and observation. Chi square 
can be small enough to allow us to accept the null hypothesis or to retain 
it with some doubt, or it can be large enough to lead us to reject the hypothesis 
with moderate or with positive assurance. Like 2 or Student's f ratio, it can 
be interpreted as being significantly or very significantly large, i.e., of being 
so large that sampling alone could account for the results only once in 20 
times, or once in 100 times, as the case may be. 

Degrees of Freedom. Tables of chi square (see Table E, Appendix B) 
enable us to decide the matter. But we must know the number of degrees 
of freedom, df, before we can use the table. In a fourfold table such as we 
have here, there is only 1 degree of freedom. 

Let us see how it is that we have only 1 degree of freedom. Remember 
that we have taken the row and column sums to be fixed. This injects con- 
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siderable restraint into a contingency table. The general rule applying to 
most contingency tables is that the degrees of freedom equal the product 
of the number of rows minus one and the number of columns minus one. 
If there are 7 rows and & columns, both r and Ё being greater than 1, 


df = (r— 10 — 1) (Number of degrees of freedom in a contin- (11.3) 
gency table of r rows and k columns) 


In a 2 X 2 table, applying the formula, we would expect 1 degree of 
freedom. This is made reasonable by the following logic. Once we have 
chosen.a single cell frequency, with the row and column sums being what 
they are, all the other cell frequencies are determined; they are not free to 
vary. This is reflected, also, by the fact that there is only one value for the 
cell discrepancies. 

The Sampling Distribution of Chi Square. The importance of degrees 
of freedom can be seen in connection with Fig. 11.1, which shows the sampling 
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Fic, 11.1. Sampling distribution of chi square for various degrees of freedom. (After 
D. Lewis. Quantitative Methods in Psychology. Iowa City: Published by the author, 1948.) 


distributions of chi square for a number of different degrees of freedom 
ranging from 1 to 10. It is because of these known distributions that the 
tables for interpreting a chi square could be constructed. In general, dis- 
tributions of this statistic are positively skewed, and the smaller the degree 
of freedom, the greater the skewness. As the number of degrees becomes 
large, this distribution approaches the normal curve in form. The dis- 
tribution with 10 degrees of freedom is only slightly skewed. 
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Use of the Chi-square Tables. Is our chi square of 7.10 significant? Table 
E shows that when df = 1 the largest chi square given is 6.635. Right above 
this is the probability of .01, which means that a chi square as large as 6.635 
or larger could occur by chance along only once in 100 times. Our chi square 
of 7.10 is larger than 6.635 and therefore could occur in the same manner 
less than once in 100 times. We therefore regard it as very significant and 
reject the hypothesis of no difference between the two groups. 

Relation of Chi Square to t. When there is 1 degree of freedom in a con- 
tingency table, chi square is equal to /*, or £ is equal to chi, the square root 
of chi square. The square root of our chi square obtained for the marital 
data, namely 7.10, is equal to 2.66. This checks exactly with the / that was 
reported in an earlier paragraph. A /test and a chi-square test of the same 
statistics will therefore lead to the same inferences when there is 1 degree of 
freedom. 

Chi Square When Frequencies Are Small. When we apply chi square 
to a problem with 1 df and when any cell frequency is less than 10, we should 
apply a modification known as Vates's correction for continuity. This cor- 
rection consists of reducing by .5 each obtained frequency that is greater 
than expected and of increasing each frequency that is less than expected. 
This has the effect of reducing the amount of each discrepancy between 
obtained and expected frequency to the extent of .5. The result is reduction 
of the size of chi square. 

The correction is needed because of the fact that chi Square varies in 
discrete jumps whereas computation by formula gives more continuous 
variations. When frequencies are large this is relatively unimportant, 
but when they are small a change of .5 is of some consequence. The cor- 
rection is particularly important when chi square turns out to be near a 
point of division between critical regions. 

An Example of Yates’s Correction. In a public-opinion poll conducted 
Some years ago, sentiment was sampled concerning attitudes toward radio 
newscasts.' Some 43 interviewees in one sample were asked the question, 
Do you find it easier to listen to news than to read it?” The sample had 
been stratified into higher and lower socioeconomic status, 19 being in the 
former and 24 in the latter. The number responding “Yes” to the question 
in the two groups was 10 and 20, respectively. The problem to be investi- 
gated is whether there is a real difference between the two groups in their 
opinions on the question. 

The data have been arranged in the usual manner in Table 11.6. Two 
of the expected frequencies seen there are less than 10. Let us carry through 
the computations first without Vates's correction and then with it to see 
what difference it may make in our conclusions. 

From Cantril, H. The role of the radio commentator. Publ. Opin. Quart., 1939, 8, 
654-662. 
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Without the correction, the cell deviations would all equal 3.26. "This 
value squared is 10.63. Applying formula (11.2) and solving, we find that 


TABLE 11.6. COMPUTATION OF CH SQUARE FOR Responses OF Two SOCIOECONOMIC 
GROUPS TO PREFERENCE FOR RADIO News TO READING A NEWSPAPER 


Obtained frequencies Expected frequencies 


Response Socioeconomic group Socioeconomic group 


Higher | Lower | Both | Higher | Lower | Both 
20 30 13.26 16.74 30 
4 13 5.74 7.26 13 


43 19 43 


chi square equals 4.76, which is significant between the .05 and .01 levels. 
With the correction, the cell deviation in all cells is 2.76 (rather than 3.26), 
whose square is 6.72. With this solution, chi square becomes 3.43, which 
fails to reach the .05 level of significance. One would have more confidence 
in the interpretation of the second outcome than the first. Not always will 
the correction make a difference of this kind in the conclusion. In any case, 
the correction should be used in a problem like this. 

It should be noted that the correction of .5 is applied to all cells in the 
table even though only one or two frequencies are small. It should also be 
noted that it is low expected frequencies that determine whether the test 
shall be applied, not low observed frequencies. It is also applied only to 
instances of 1 df, including 2 X 2 and 1 X 2 tables. In larger tables the 
need for correction is not so great and it would be complicated to apply. 
There is also the possibility of combining categories in such a way as to get 
rid of small expected frequencies. Examples of this will be seen later. 

Testing Significance by Direct Computation of Probability. There are 
lower limits to utilizable frequencies, when even Yates's correction is inade- 
quate. If any expected frequency is less than 2, we should not apply the 
computing formulas for chi square, even with the correction, If there is 
1 df and there is a frequency less than 2, it is still possible to answer the 
question, “Given the marginal frequencies, what is the probability that 
distributions among the four cells could be as extreme as this one or one 
more extreme?” The probability, and hence the level of significance, can 
actually be computed without computing chi square." 

For the special case of a fourfold table in which two equal groups of observa- 
tions are being compared, Table N in Appendix B will serve to answer the 
question of statistical significance. It was designed for the following very 


1 For a method for determining exact probabilities of distributions in contingency tables, 
see Walker, Н. M., and Lev, J. Statistical Inference. New York: Holt, 1953. Р. 104. 
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9 
common type of problem. Let us say that an experimental group of 30 indi- 
viduals is administered a dose of dramamine sulfate and a control group of 30 

"individuals is administered a placebo before a rough flight in an airplane. Of 
the experimental group 5 become airsick and 25 do not; of the control group 
18 become airsick and 12 do not. 

In Table N, each row pertains to groups of a certain size, №;. In the illus- 
trative problem, V; = 30. A column is provided with frequencies from 0 to 
Ni/2. In using Table N, locate the row that applies, in this case the row for 
N; 30. Next, find the column headed with the number that corresponds 
to the smallest frequency in the fourfold table. In this problem that fre- 
quency is 5, the number in the experimental group who became airsick. 
Given these two values, 30 and 5, we ask the question, How many cases are 
needed in the other group parallel to the smallest cell frequency to achieve 
chi squares significant at the .05 and .01 levels?" Parallel to the frequency 
of 5 is the frequency of 18 airsick cases in the control group. Table N tells 
us that it would take 13 airsick cases in this group to indicate significance at 
the .05 level and 16 cases to indicate significance at the .01 level. Our 
obtained frequency of 18 exceeds both those values and is therefore a strong 
basis for concluding that we have significance beyond the .01 point. 

Table N has solutions based upon exact probabilities up to an J; of 20 and 
solutions by formula with Yates's correction for Ves greater than 20. 

Other Ways of Computing Chi Square in a 2 X 2 Table. In a fourfold- 
table problem, since the discrepancy is the same for all cells, the formula for 
chi square can be written 


1 
L= yo 5 (+) (Chi square in a 2 X 2 contingency table) (11.4) 


That is, chi square equals the common discrepancy squared times the sum of 
the reciprocals of the four f/s. As applied to the marital-status problem 


ДИ ie Yc af 1 

2 = 2 

о (sts * 975 108.5 15) 
18229601026 + 01026 + 00922 + 00022) 


mI 


If the data are arranged in a 2 X 2 table as shown in Table 11.7, another 
convenient formula for the computation of chi square is 


ЕЕ N(ad — by? _ (Alternative formula for (115 
@FHe+ SC Fale Fa e i afu- (115) 


Applied to the opinion-poll data, 


2 _ 43{(10)(4) ~ (20)(9) 
x (30) (19) (24) (13) 


= 4.74 
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The answer is within rounding error of that computed earlier by formula 
(11.2). 


The last solution was done without Yates's correction. The same formula 
with Yates's correction incorporated reads 
B NY 
N (а — be] — 2) [Same as (11.5) with Yates's 
* = correction] f (11.6) 
(а + b)(a +00 + d)(c + d) 


Note that the difference ad — bc is taken as positive, as indicated by the 
vertical lines enclosing it. 


TABLE 11.7. SYMBOLIC ARRANGEMENT OF DATA IN A 2 X 2 CONTINGENCY TABLE 
ILLUSTRATED BY THE PUBLIC-OPINION DATA 
Variable II Socioeconomic Group 


Higher | Lower | Both Higher | Lower | Both 


U 
Ж | Higher...) а 55 |а р 10 20 30 
Lower... „с а c+d 9 4 13 

Both... a P DTI N 19 24 43 


Chi Square in Other Than 2 X 2 Tables. The use of chi square is by no 
means limited to fourfold contingency tables. It can be applied with as few 
as two cells and with a much larger number. First, an example with only two 
frequencies to be tested. 

Ina Two-cell Table. For this purpose let us use the polling data on prefer- 
ence for the radio. Combining the two socioeconomic groups, we may be 
interested in knowing whether the population they represent is actually in 
favor of radio newscasts. The sample is so small that there may be some 
doubt. The frequencies are 30 in favor and 13 not. Could these frequencies 
have arisen from a population in which the opinion is really evenly divided? 
The null hypothesis for this purpose is a 50-50 division. This is an arbitrarily 
chosen hypothesis; we could have chosen some other, such as а 60-40 division 
of opinion. 

With the 50-50 hypothesis chosen, the expected frequencies are 21.5 and 
21.5, these being one-half of 43. The cell deviations or discrepancies (f, — to) 
are 8.5, one positive and the other negative. The squared discrepancy is 
72.25. Dividing this by fa which is the same in both cells, we get a squared 
contingency of 3.360 for each cell. For the two combined we get 6.720, or a 
chi square of 6.72. This is significant beyond the .01 point. 

With a two-cell table, when expected frequencies are equal, as in the last 
illustration, the formula for chi square reduces to the simple form 


х = 2(fo — fe)? (Chi square in a two-cell table when expected fre- (11.7) 
ТА quencies are equal) 
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Since, with 1 degree of freedom, / = x, another formula for /, derived from 
(11.7) and applying in the same special but not uncommon situation, is! 
TM . 
VI fs 
where fı = the larger of two frequencies and fı + f» = N. 
Applied to the polling problem, 


(t test of departure of two frequencies from equality) (11.8) 


The square of this value is 6.71, which checks with the chi square obtained 
above, without correction for continuity. Correction for continuity would 
involve the use of the expression (fi — f» — 1) in the numerator of formula 
(11.8) in place of (fı — fe). 

Chi Square in Larger Tables of Frequencies. To illustrate the application 
of chi square to a larger table, this time with a table of six cells, let us consider 


Taere 11.8. A CHI-SQUARE SOLUTION IN A 2 X 3 TABLE OF Data ON OPINIONS 
EXPRESSING AGREEMENT OR DISAGREEMENT WITH A CERTAIN RADIO COMMENTATOR 


Categories of response Both 
Agree... 73 95 
Disagree. 9 13 
Doubtful 41 68 

Totals. 123 176 


fo — fi)? 
Discrepancies 
squared 


7, 
Expected frequencies 


1. —J. 


Discrepancies 


* By a little algebra, it will be found that (f; — fı) = 2(f, — f.) and that 
Nm Sith 
> 2 


Equation (11.7) then becomes 


а= Qi — fa)? 
hth 
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some more survey-of-opinidh data. This time the question was whether the 
radio listener agreed with the opinions expressed by a certain radio commenta- 
tor, and the responses were tabulated as “Agree,” “Disagree,” or “Doubtful.” 
The survey was made in two cities and we have the numbers responding in 
each way їй both of them. The results are listed in Table 11.8. 

The derivation of the expected frequencies was carried out with the applica- 
tion of formula (11.1). From here on, the work as recorded in Table 11.8 is 
just as we have done previously. The sum of the square contingencies is 5.13. 
The degrees of freedom (according to formula 11.3) are 2 X 1 = 2. For 2 
degrees of freedom the tables of chi square show that it requires a chi square 
of 5.991 to be significant at the .05 level. Our chi square falls below this 
level, and so there is no really convincing reason to doubt that the two popula- 
tions sampled are alike on the question at issue, though there are less than 10 
chances in 100 that a chi square as large or larger could have arisen by chance. 

The two small expected frequencies in Table 11.8 should raise some ques- 
tion concerning the need for action. 

Combining Columns or Rows. No expected frequency is less than 2, but 
if we should decide that it is too risky to solve the problem with so small an 
fo, there is one thing we could do. Incidentally, it happens in this particular 
problem that the squared discrepancy (fo — Ja)? was practically zero for the 
cell in which f, was smallest, so that this cell makes no contribution to chi 
square. It is a situation in which a very small f, is combined with a relatively 
large squared discrepancy that is serious, for then the cell’s contribution to 
chi square is unduly large and yet of doubtful stability or meaningfulness. 

If we had combined the “Disagrees” with the “Doubtfuls” in this prob- 
lem, we should have had observed frequencies of 50 for Syracuse and 31 for 
Columbus, with expected frequencies of 56.6 and 24.4, respectively. We can 
combine both observed and expected frequencies after the latter have been 
computed in uncombined form. After this kind of a combination is made, 
the size of chi square is likely to be smaller than before, though not always. 
Even though it is smaller, the number of degrees of freedom is also reduced 
and the significance limits are accordingly smaller, so that the chances of a 
significant departure of data from the null distribution are presumably about 
the same as they were. 


Some SPECIAL APPLICATIONS OF CHI SQUARE 


Chi Square When Proportions Are Correlated. Many of the applications 
of chi square involve the comparison of two proportions or percentages, as we 
have seen. In all the examples thus far the two proportions are uncorrelated, 
for they were derived from different observations or individuals. The chi- 
Square formulas given thus far assume such experimental independence. We 


"Cantril, op. cit. 
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shall now consider some applications of chi m when proportions are 
correlated. 

Test for Two Correlated Proportions. For a difference between two corre- 
lated proportions we saw in Chap. 10 a z test. Since with 1 degree of freedom 
x? is equal to 3°, we might expect a very direct estimate of chi square by 
squaring both sides of formula (10.10). This expectation is correct, and the 
formula is 


4 (b — с)? (Chi square for a difference between (11.9) 
TIRES +c two correlated proportions) ii 


where the symbols are as defined in Table 10.3. 

It should be noted here, as in Table 10.3, that b and c indicate the numbers 
of cases that change categories between a first and second application of the 
experiment. Either the same individuals or matched individuals must be 
involved so that the numbers of changes may be counted. The illustrative 
problem in Chap. 10 involved 100 students who had attempted to answer two 
items. If there is correlation between the items there is also correlation 
between the proportions. The number of changing individuals denoted by Б 
(answering the first correctly but not the second) was 5. The number of 
changing individuals denoted by c (answering the second correctly but not 
the first) was 15. Applying formula (11.9), 

з 6-15 


sspe THO 


which is significant between the .05 and .01 points. 

In small samples, Yates’s correction should be incorporated in formula 
(11.9). This involves deducting 1 from the difference, where the difference 
is regarded as positive, before squaring. 

Test for More Than Two Correlated Proportions. A chi-square test for 
differences among more than two correlated proportions is described by 
McNemar.! 

Chi-square Test of the Hypothesis of Normal Distribution. One con- 
venient use of chi square is in testing whether or not a set of observed 
frequencies in a frequency distribution could probably have arisen from a nor- 
mally distributed population. The procedure is like that in previous exam- 
ples, except that the expected frequencies are estimated in a different manner. 
Following the procedure illustrated in Chap. 7, Table 7.1, the mean and 
standard deviation of the obtained data are assumed to be the mean and 
standard deviation of a normal curve that comes closest to the data. The 
discrepancies between expected and observed frequencies are found and 
squared. The squared differences are divided by their corresponding 
expected frequencies to find the usual ratios, which are summed to find chi 


1 McNemar, Q. Psychological Statistics. New York: Wiley, 1954. P. 232. 
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square. The number of df to use is the number of class intervals or categories 
minus 3. One degree of freedom has been lost in computing the mean; a 
second in computing the standard deviation; and a third for W, the size of 
sample. These three statistics place restrictions upon the freedom of the 
observed frequencies to vary from the expected ones. 


TABLE 11.9. A CHI-SQUARE TEST OF THE NORMAL-DISTRIBUTION HYPOTHESIS 
APPLIED TO A FREQUENCY DISTRIBUTION OF SCORES 


(1) (2) (3) (4) (5) (6) 
Original Regrouped Squared Cell-square 
grouping frequencies 5 e д cell contingencies 

Scores discrepancies discrepancies |. (f, =з 
I. „, ee e 
44-46 0 0.2 
41-43 1 0.8 5 3.2 41.8 3.24 1.012 
38-40 2.2 
3957 „ 108,071 ор 0.0 
32-34 8 9.0 | 8 9.0 -1.0 1.00 0.111 
29-31 | 14 | 13.3 | 14 | 13.3 +0.7 0.49 0.037 
26-28 | 17 | 15.8 | 17 | 15.8 41.2 1.44 0.091 
23-25 9 | 15.1 9 15.1 —6.1 37.21 2.464 
20-22 13 11.7 13 11.7 1.3 1.69 0.144 
17-19 8 ae 8 7.2 +0.8 0.64 0.089 
14-16 3 3.6 3 3.6 —0.6 0.36 0.100 
11-13. | И „ 42.0 4.00 2.000 

8340 | 0 0.5 2 1 

E 86 | 85.9 | 86 | 85.9 40.1 x! = 6.048 

мй ud or ee Ee шогы ct Mies 


At the tails of the distribution, where f;'s tend to be very small, we allow 
none to be less than two. We do this by combining intervals. As we com- 
bine intervals we lose degrees of freedom. Note that it was stated above 
that the number of df is the number of categories minus 3, unless there has 
been no combining, in which case we can say that df equals the number of 
intervals minus 3. Another thing to be concerned about is that the sum of 
the expected frequencies equals N or approaches it very closely. The sum of 
the discrepancies should equal zero. ў 

Using the data from Table 7.1, with the expected and obtained frequencies 
already given, we will make the chi-square test. First, to get rid of some 
Very small tail frequencies we combine three classes at the upper end of the 
distribution and two at the lower end. All the expected frequencies are now 
2 or greater. The results of this are shown in Table 11.9. The next steps 
are carried out, as shown, with a resulting chi square equal to 6.05. With 
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7 df, a chi square of 14.07 is required for significance at the .05 point. We 
definitely should not reject the hypothesis of normality of distribution. 
From the chi-square table we find by rough interpolation that about 60 per 
cent of the chi squares from similar samples from the same population could 
be as large as 6.05 or larger. We may accept the idea that the population 
from which the sample came is normally distributed on the scale of measure- 
ment used. 

On very rare occasions the probability of a chi square so far from zero as 
the obtained one is extremely high, perhaps even .95 or higher. In this 
event, we have a value that is near the zero end of the chi-square distribution, 
which is an outcome as rare as a large one significant beyond the .05 point. 
Some investigators suspect such an outcome and look for computing errors or 
some other possible source of bias that might produce this kind of rare event. 
The fit is often regarded, under these conditions, as “too good to be true.” 
If no artificial reason is found for this outcome, there is no need for any par- 
ticular action. 

It might be added that goodness of fit of data to other than normal distri- 
butions can also be tested by chi square, for example, a binomial distribution. 
In general, the procedure parallels that given for the normal curve, but there 
would have to be a decision as to degrees of freedom to fit the logic of the 
particular case. In the case of a binomial distribution, the number of df 
equals the number of categories minus 1, the one restriction being that the 
discrepancies must add up to zero. 

Test for Homogeneity of Variances. We sometimes want to know whether 
the differences among variances from several similar samples indicate that 
they came from populations differing in variance or whether they could have 
arisen from a common population with respect to variance. Whether we 
combine several samples to make a larger one sometimes depends upon the 
answer to this question (along with other questions, such as whether the 
means, also, are homogeneous). Making a test of homogeneity of means 
also rests upon the assumption that variances are homogeneous. 

With respect to sample variances, we can obtain a test of the differences 
among them, leading to a statistic that can be interpreted as chi square. 
There are several ways of doing this. The method to be described is known 
as Bartlett’s test. 

Bartlett's Test of Homogeneity. When the N’s differ among the samples, a 
sampling statistic B’ is given by the formula 


В' = 2.3026[ (log $2) y — k) — Z(n; — 1) (log s?,)] (11.10) 
(Sampling statistic for Bartlett's test of homogeneity of variances) 
where 2.3026 — constant needed because we use common logarithms instead 
of Napierian logarithms 
5? = unweighted arithmetic mean of the several variances 
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N = total number of observations in all samples combined 
n; = number of observations in any one sample 
k — number of samples 


TABLE 11.10. APPLICATION OF BARTLETT'S Test or HOMOGENEITY OF VARIANCES 
IN Four SAMPLES 


st m-—1 log 82 (n; — 1)(log s?) | 1/(n; — 1) 
194.97 201 2.2900 460.2900 .004975 
109.44 138 2.0392 281.4096 .007246 
162.28 65 2.2103 143.6695 .015385 
100.03 165 2.0000 330.0000 .006061 


2 566.72 569 8.5395 1,215.3691 .033667 


3* 141.68 (N — k) 
log 5? = 2.1513 


As an example, let us use the variances from Data 4D, in which there were 
four samples involving possible differences between the sexes and also between 
alcoholics and nonalcoholics.! Thus, k = 4. The variances are given in 
Table 11.10, with the corresponding df, which is эң — 1 for each sample. 
The logarithm for s? is given in the third column, and the products of df times 
its corresponding log s? in column 4. The mean of the four variances is 
141.68, whose logarithm is 2.1513. N E, the df for the combination of 
samples, equals 569. We are now ready to apply formula (11.10). 


B' = 2.3026[ (2.1513) (569) — 1,215.3691)] 
= 2.3026(8.7205) 
= 20.0801 


The statistic B’ has a sampling distribution that approaches that of chi 
Square and can be interpreted safely as chi square except when it falls near 
the boundary of a selected region of significance. Since there are * —1df 
involved here in the interpretation of B’, the obtained B’ is significant well 
beyond the. Ol point. With 3 df, a chi square of 11.341 is required at the .01 
point. 

A correction in B’ may sometimes be needed in order to make the inter- 
pretation more exact. This correction is known as C and the corrected 
Statistic as B, where B = B'//C. The formula for computing C is 

G1 (н) “Ший anm 
where the symbols are as defined for formula (11.10). 


It is probably best to confine the use of Bartlett’s test to%amples differing with respect 
to only one source of variation rather than two or more, as we have in the illustration. 
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The new information needed in applying formula (11.11) is the sum of the 
reciprocals of the four sample df's. We see those reciprocals in the last 
column of Table 11.10, also their sum. The solution for C in this problem is 
ale 
3(3) 
Cis usually just slightly greater than 1.0, which indicates that B’ is a bit too 
large. Applying the correction to the computed B’, we have 

_ В! _ 20.0801 


= сто 2001 


C=1+ (.033667 — .001757) = 1.003546 


The change from B’ to B here is trivial. It would usually have little effect 
upon the major statistical decision. 
Bartlett's Test for Samples of Equal Size. When the samples are of equal 
size, there is some saving of computation by use of the formulas 
(Bartlett’s test when 
B = 2.3026(n; — 1)(k log 5? — Z log 5%) ише areofequal (11.12) 


size, 


and C=1+ Sk(u — 1) [Correction to go with (11.12)] (11.13) 

F Tests Following Barllet's Test. Having found a significant B statistic, 
we may be interested in knowing between what pairs of samples the signifi- 
cant differences are. In the data of the illustration, we might want to know 
whether there is a significant sex difference or a significant difference between 
alcoholics and nonalcoholics, or both. We may test this by applying the F 
test as described in Chap. 10, taking two variances at a time. It should be 
said, however, that if B proves to be insignificant and we accept the null 
hypothesis for the whole group of samples, we should not then apply an F test 
to any pair. We should distrust any significant F in this case, even if we 
happened to find one. 

Significance of a Combination of Tests. Sometimes we have made a 3, t, or 
F test in several similar, independent samples. Perhaps the sampling statistic 
was not significant beyond the adopted probability level in any sample, and 
yet the deviations were all in the same direction from the value indicated by 
the null hypothesis. In other instances, perhaps some of the samples gave 
Significant results and some did not. Were the significant ones merely high 
chance deviations? Some method is obviously needed to make a single test 
of all the data. 

If we happened to know that certain sets of the data came by random 
sampling from the same population, and the means and variances within 
those sets prove to be homogeneous, we should be justified in pooling the sets 
and making new tests of significance, with enlarged df and more power. But 
this is probably not the most efficient way, even if we already have the neces- 


си. 11] CHI-SQUARE AND OTHER STATISTICAL TESTS 245 


sary information regarding homogeneity. There are ways of considering in 
combination the results of several significance tests already applied to the 
samples individually. A method based upon the binomial distribution will 
be mentioned, and a method of combining probabilities will be described. 

Probability of Repeated Significance in a Binomial Distribution. If we have 
adopted the .05 level of significance for each sample, and we have k samples, 
the expansion of the binomial (p + q)*, where p = .05 and у = .95, will give 
us the probabilities of different numbers of outcomes at the .05 level. The 
same could be done for outcomes at the .01 level, in which case ф = .01 and 
д = .99. We could answer such questions as, “In five samples, what is the 
probability of obtaining two or more # ratios beyond the .01 level?” The 
solution would be similar to that described in Chap. 10 for the use of the 
binomial distribution. 

Wilkinson has tabled the tail probabilities for numbers of samples from 2 to 
25, for either the .05 or the .01 level.! If we have more than 25 samples, we 
might consider using the normal-curve approximation, as described also in 
Chap. 10. However, the р is so small that the limiting case of Vp (for justi- 
fying a normal-distribution approximation) would require an V of 100 sam- 
ples when our significance level is .05, and 500 samples when the level is .01. 
Sakoda, Cohen, and Beall have provided charts to take care of cases of N up 
to 100 for the .05 level and of M up to 500 for the. OI level." 

A Chi-square Test of Combined Probabilities. A method of combining tests 
of significance that does not require so many computing aids, but does involve 
the use of logarithms, will now be described. It has been demonstrated that 
there is a mathematical way of transforming a probability into a chi square. 
In general, x? = —2 log, p, with 2 df. In this method, then, we shall need 
to know the value of the probability attached to each obtained sampling 
statistic. This can be obtained, of course, from tabled distributions of 2, /, or 
F, whichever test we are applying. 

Where there are several probabilities involved, we can transform each into 
a chi square, then sum them, and sum their corresponding degrees of freedom. 
Because of the additive property of chi square, the sum is also a chi square 
with combined df. The computing formula is 


x? = —4.605 L log p: (СЫ воцат {ог a combination of proba- (11.142) 
x! = —4.605 log (PIP Pr) (11.148) 


where p; = probability that a deviation of the obtained size could occur by 
chance. The constant —4.605 represents the product of —2 times the con- 
stant 2.3026, which is needed because we use common logarithms rather than 


! Wilkinson, B. A statistical consideration in psychological research. Psychol. Bull., 
1951, 48, 156-158. 

2 Sakoda, J. M., Cohen, B. H., and Beall, G. Tests of significance for a series of statisti- 
cal tests. Psychol. Bull., 1954, 61, 172-175. 
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Napierian logarithms. The sum has 2k degrees of freedom, where & is the 
number of tests made. It will be noted from the two forms of the equation 
that we may obtain the logarithm of each probability first, then sum them 
(11.14), or we may find the product of all the probabilities and then find the 
one logarithm of the product. The latter solution is simpler when £ is small. 

Suppose that we have derived three estimates of correlation between a pair 
of variables in three samples. In each sample we have tested the hypothesis 
of no correlation, yielding a 8 or a ѓ ratio. The probability for such a ? or f 
value would be found by reference to the normal or the Student distribution, 
respectively.! The probability associated with a one-tail test is the one to use in 
formula (11.14).? 


TABLE 11.11. Cur SQUARE FOR A COMBINATION OF THREE PROBABILITIES 


Zlog p; = —3.8416 = —5 + 1.1584 


In Table 11.11 we have, first, three coefficients of correlation from three 
independent samples. Each was based upon a sample in which V = 50. 
The M's need not be equal for applying this chi-square test. The SE of anr 
of zero when N = 50 is .143. Each r deviates from zero by the number of z 
units shown in the second column. One-tail tests give the probabilities in 
column 3. The logarithms of these probabilities are given in column 4. 

As the student who remembers his algebra will recall, the four digits to 
the right of the decimal point are found in the table of common logarithms 
(see Table K). The negative number at the left of each decimal point 
comes from the fact that each probability is a value less than 1.0. The rule 
is to make this number one more than the number of zeros to the right of the 
decimal point in р;. The summing of these logarithms is done for the two 
components separately, after which an algebraic sum of the two component 
sums is found. The sum of the logarithms is a numerical value of — 3.8416. 
Multiplying this by —4.605 [from formula (11.14a)], we find a chi square of 
17.69. Reference to the chi-square table with 6 df shows this to be definitely 
significant beyond the .01 level. Thus, a correlation that failed to be sig- 
nificant beyond the .01 point in any of the three samples is found to be in 
that region when the tests are combined, 

Several restrictions and qualifications with regard to the use of this com- 

' For probabilities from Student’s distribution, see Walker and Lev, op. cit. Table IX. 


See Gordon, M. H., Loveland, E. H., and Cureton, E. E. An extended table of chi- 
square. Psychometrika, 1952, 17, 311-316. 
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posite test should be Sendo The fact that the tests combined should have 
been based upon independent samples was stressed. The fact that the 
probabilities to be utilized should be from one-tail tests was mentioned. If 
in the end a two-tail test is wanted, we must double the probability attached 
to the obtained chi square. 

If several parallel tests of samples have been made, the combination of 
these that is tested should not be a selected one, for example, those with 
highest №; values only. All legitimate single-sample tests that are clearly 
parallel should be included. 

If the deviation from the null hypothesis happens to be in the opposite 
direction for any of the samples, for those samples use q (фу = 1 — p, where p 
is the smaller tail area) instead of p, but include such a sample. 


OTHER DISTRIBUTION-FREE STATISTICS 


In recent years, many new statistical processes have been developed to 
take care of the experimental situation in which samples are small and the 
form of the population distribution is not normal. Some of these statistics 
will now be described. 

Before an investigator resorts to their use, however, he should consider 
whether any of the more powerful tests can be used in any way. Except 
for chi square, the distribution-free, or nonparametric, methods generally 
have lower power to detect a real difference as significant. When there is 
any choice, therefore, we should prefer a parametric test, except where a 
quick, rough test will do. We can sometimes create a choice where there 
seems to be none, as will be seen in the following discussion. 

Transformation of Measurement Scales. Sometimes the nonnormal 
form of distribution in a population is due to an inappropriate measuring 
scale or to restrictions that result in distorted scales. For example, dis- 
tributions of simple reaction times are generally skewed positively. This 
may be viewed as partly because of the restriction that no reaction time can 
be less than zero, or because there is some minimal time below which the 
reactor cannot go, with no restriction at the other side of the distribution. 
The effects of the restriction are felt all along the range; the distribution is 
not simply truncated. 

The question posed by such a situation is whether, by some method of 
transformation, we can convert the measurements into values on a new 
scale on which the distribution is normal. A logical justification for such a 
transformation would be our belief that the underlying psychological variable 
or trait is normally distributed, if only we had an appropriate scale. Such 
logical defense would not be essential, however. Tests made of statistics 
on transformed scales lead to conclusions that hold for the natural phe- 
nomena under investigation. We saw this in connection with the trans- 
formation of r to Fisher’s z. 
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One possibility for transformation of reaction-time measurements would 
be to find the logarithm of each time value. This would condense the larger 
measurements into smaller scale ranges relative to the smaller measurements 
and thus reduce skewing, if not eliminate it. With the measurements 
transformed to log time, we could proceed to apply parametric tests, even 
in small samples. 

Other examples of nonlinear transformations could be given. One is the 
conversion of proportions or percentages into corresponding angle values in 
degrees of arc. In sampling, these are normally distributed where extreme 
proportions are not. For an excellent discussion of the subject of trans- 
formation, see Mueller. 

The Sign Test. One of the simplest tests of significance in the non- 
parametric category is the sign test. Let us say that we have two parallel 
sets of measurements that are paired off in some way. The data in Table 
11.12 are 10 of the successive pairs of knee-jerk measurements presented 
originally in Table 9.5. The hypothesis to be tested is that they arose from 
random sampling from the same population. If this hypothesis were true, 
half the changes from 7 to R should be positive and half should be negative. 
Another way of stating the null hypothesis is to say that the median change 
is zero. 


TABLE 11.12. APPLICATION OF THE SIGN TEST TO 10 PAIRS ОР THE KNEE-JERK 
DATA FROM TABLE 9.5 


. Sign of 
T R Т-К 
19 14 + 
19 19 (0) 
26 | 30 - 
15 7 + 
18 | 13 + 
30 | 20 c 
18 17 + 
30 29 + 
26 18 + 
28 21 + 


* Т = knee-jerk measurement under tension. 
R = measurement under relaxation. 


There are 10 pairs of observations; 10 changes are involved. But note 
that one change is zero. Since we cannot include this as either positive or 
negative, it is discarded, leaving nine changes for the test. The hypothesis 


Mueller, C. G. Numerical transformations in the analysis of experimental data. 
Psychol. Bull., 1949, 46, 198-223. 
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now calls for 4.5 positive differences, whereas we obtained eight. Is this a 
significant deviation? 

The rather obvious test to make is based upon the binomial distribution 
for p = .5 and n = 9. On this basis, eight or more plus signs could occur 
by chance 10 times in 512 trials (1 chance in 512 for exactly nine, and 9 
chances for exactly eight). For a one-tail test this deviation is significant 
with P equal to approximately .02. For a two-tail test we double the 
probability, as usual, which gives a departure significant at the .04 level. 
We would make a two-tail test if our alternative hypothesis were that these 
results did not come from the same population, t.e., with respect to central 
value, We would make a one-tail test if the alternative hypothesis at the 
start were that the T values tend to be higher than the R values. 

Table О in Appendix B will be useful in applying this sign test, since 
frequencies for the binomial distribution (where р = .5) are given, as well 
as their total, for each value of u up to n = 20. For cases in which the 
number of pairs is greater than 20, the normal-curve approximation may be 
used, as described in Chap. 10. 

The assumptions involved in making the sign test include mutual inde- 
pendence of the differences. The members of pairs may be correlated or not. 
Nothing is assumed concerning the shape of the distribution or concerning 
equality of variances. The differences need not even be measured accurately, 
but the direction of each difference should be experimentally established. 

One weakness of the sign test is that it does not use all the available 
information. If the measurements are on a scale of equal units, on which 
differences may be compared for size as well as for direction, the sign test 
ignores the information provided by size. It is said that, except for very 
small samples, the sign test is only about 60 per cent as powerful as a f test 
would be for the same data, where both apply. This difference in power 
could be compensated for by increasing the size of sample. If we had 
applied the sign test to the entire data in Table 9.5, we should have found 
that 18 out of 25 signs are positive. By the use of the binomial distribution, 
this would indicate a deviation significant near the .02 level (one-tail test), 
which agrees with the result from the smaller sample of 10 pairs. In Chap. 9, 
however, the 2 test for the same complete data was significant almost to the 
.001 point in a one-tail test. The difference in sensitivity of the two tests 
in this particular illustration seems to be appreciable. 

The Median Test. The median test involves finding a common median 
for the two samples being compared, as a first step. Next, the numbers of 
cases above and below the common median are counted in each sample, 
resulting in a fourfold contingency table, as in Table 11.13. The observa- 
tions are not paired or correlated, and the N may differ in the two groups. 
Equal N's would make the test easier to apply, as will be seen. Then we 
may use help from Table N. 
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TABLE 11.13. APPLICATION OF THE MEDIAN Test TO Two SAMPLES UNDER 
Conpitions A AND B 


Samples 

А УВ Contingency Table 
14-5 Samples 
13 7 

10 6 

127558 

10+ 

15:11 

9 8 25 

9 10 

Mdn — 9.5 


The median of the 14 observations in Table 11.13 is 9.5. Values of 10 
and above are easily segregated from those of 9 and below, as shown in the 
fourfold table. We cannot estimate the chi Square, or its level of significance, 
for this table. Reference to Table N indicates that chi Square is not sig- 
nificant, with a P greater than .05 (two-tail test). 

The hypothesis tested is that the median is the same for both populations. 
Since the samples are likely to be small in making this test, exact probabilities 
should be obtained or Table N should be used. If a one-tail test is wanted, 
„then a more exact P should be estimated and this P divided by 2. 

Median Test with More Than Two Samples. Suppose that we have three 
samples, each from its own treatment or set of conditions, We want to 
test the homogeneity of their central values, For example, consider the 
three samples in Table 11.14. 


Taste 11.14. APPLICATION OF THE MEDIAN Test то Мове THAN Two SAMPLES 
Samples 


Contingency Table 


Mdn = 9.0 


The median of all 18 observations is 9.0. Since we haye some 9's in the 
lists, we cannot make the point of dichotomy at exactly 9. In such a situa- 
tion we make it as near the median as we can. Let it be the point9.5. We 
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then obtain the contingency table, as in Table 11.14. The chi square com- 
puted from this 2 X 3 tableis 7.82. With 2 df, we find this to be significant 
at approximately the .02 point. We reject the null hypothesis, and we 
may then test for significance the differences between pairs, if we wish. 

The Sign-rank Test of Differences. A pair of test methods that have to 
do with ranking of observations in two samples will be described next. They 
may be attributed to Wilcoxon.! In the first of these methods, we rank the 
differences, or changes, according to absolute size. In the second, we rank all 
the measurements in one combined group in terms of size. In the former we 
need paired observations; in the latter we do not. 


TABLE 11.15. APPLICATION OF THE SIGN-RANK TEST OF DIFFERENCES, USING THE 
KNEE-JERK DATA 


T* R Т-к Rank of absolute Ranks with 
difference minority sign 
19 14 T5 4.5 
19 19 0 
26 | 30 —4 3 —3 
15 7 +8 8.5 
18 13 +5 4.5 
30 | 20 +10 9 
18 17 +1 1.5 
30 29 +1 1.5 
26 18 +8 8.5 
28 21 +7 7 
T=-3 


* T = knee-jerk score under tension. R = score under relaxation. 


Let us use as an illustration of the sign-rank test of differences the same 
data to which we applied the sign test in Table 11.12. The 10 pairs of knee- 
jerk measurements under tensed and relaxed conditions are repeated for con- 
venience in Table 11.15. Here the numerical differences, with algebraic 
Signs, are also listed. Unlike thesign test, this one utilizes the additional infor- 
mation of sizes of differences. As in the sign test, however, we cannot use zero 
differences, since the differences must be classified according to algebraic sign. 

Having the differences with their algebraic signs, we first forget the signs 
and rank the differences according to size only, giving the smallest difference 
arank of 1. There are two differences of 1. We do not know which one to 
call rank 1 and which rank 2, and so we give them each an average rank of 1.5. 
The next smallest difference is 4, which is given a rank of 3, and so on until 
all nonzero differences are ranked. * 


‘Wilcoxon, F. Some Rapid Approximate Statistical Procedures. Stamford, Conn.: 
American Cyanamid Co., 1949. 
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Next, we consider the algebraic signs of the differences. We single out all 
differences whose sign is in the minority. If there are fewer negative than 
positive signs, as here, we select all ranks corresponding to the differences 
having that sign. There is only one negative difference in Table 11.15. We 
put this rank with negative sign in the last column. We sum this column to 
give a statistic T. 

The hypothesis tested is that the differences are symmetrically distributed 
about a mean difference of zero. If this were true, T would coincide with the 
mean of such sums of randomly selected ranks, T, which is also half the 
sum of N successive ranks, and which would be given by the formula 


f= ко (Mean of sums of ranks) (11.15) 


"The deviation obtained is 7 — 7. Wilcoxon has supplied a table giving 
the deviations significant at the .05, .02, and .01 levels (see Table P, Appendix 
B). Reference to Table P indicates that the obtained T of —3 (the algebraic 
sign does not matter in the use of the table) is significant at the .02 level (a 
two-tail test), when we have nine differences involved. 

For an N greater than 25, the T values significant at various probabilities 
can be found by using the equations: 


fa = T= 196 N 


ya T DT (T statistics significant at (11.16) 


various levels) 


Ta = Т — 2.326 


Ta 7 — 2.576 Лан 


where T = mean of the sums of ranks and the radical expression is the 
standard deviation of the sampling distribution of 7. 

It will be seen that the outcome of this test agrees with that from the sign 
test for the same data. There will not always be this much agreement, and 
when there is not, the result of the sign-rank-difference test would be regarded 
as more dependable, since it rests upon more information. 

The Composite-rank Method. When the observations are not paired so 
that we can operate with differences, a ranking of all single observations is the 
basis for the test next to be described. If two samples came from the same 
population, when the observations are put in one composite ranking, the 
sums of the ranks belonging to the two samples should be equal. The test 
here is of the departure of the sums of ranks from equality. 

Consider the two samples of seven cases each, in Table 11.16, obtained 
under conditions А and B. We assign the lowest ranks to the lowest values. 

There are two lowest scores of 5, each of which receives a rank of 1.5. Tne 
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TABLE 11.16. APPLICATION OF THE К TEST ОР A DIFFERENCE, BASED UPON THE 
Sum or RANKS 


Measurements Ranks 
A B A B 
14 5 13 1:5 
13 7 12 4 
10 6 8.5 3 
12 5 11 1.5 
15 1 14 10 
9 8 6.5 5 
9 10 6.5 8.5 
2 71,5 33.5 
Ra Ry 


score of 6 then receives a rank of 3, and so on, until the highest score of 15 
receives а rank of 14 (which equals JV unless there are ties for top place). 

The sums of the ranks for conditions A and B, which we shall call Ra 
and Rs, аге 71.5 and 33.5, respectively. The check for these sums is that 
they should sum to N(N + 1)/2, where there are N ranks. In this case, 
71.5 + 33.5 = 105 = N(N + 1)/2. 

We select the smaller of the two sums, which happens to be Rs in this prob- 
lem, as our sampling statistic. It is distributed about the mean of the sums, 
which is given by formula (11.15) but which will be called R. For values of 
N; (number in each sample) not greater than 20 and for samples of equal size 
(Na = Ny = Nj, Wilcoxon has tabled values of significant R's. Table Q in 
the Appendix provides those values. With seven replications (N; = 7), an 
R of 33.5 is significant between the .02 and .01 levels, a bit closer to the .02 
level (two-tail test). 

For the application when AN; exceeds 20, the R's significant at the three 
levels may be computed by the formulas 


Ros = Ё — 1.960 n 


Ra = Ё — 2.326 NU (Values of statistic R significant at three (11.17) 


levels) 
Ro = Ё — 2.576 * 


where R = mean of the sums of ranks and the radical expression is the 
standard deviation of the sampling distribution of R. 

The Mann-Whitney U Test. There is a generalization of the R test just 
described to take care of samples of unequal size. For this more general case 
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we have the Mann-Whitney U test. The hypothesis being tested is the same 
as for the R test, and also the operations through to the finding of the sums 
of the ranks. Either sum can be treated as statistic U. When Naand № are 
both as large as 8, a 2 test can be used and сап be computed by the formula 


= E (Е value for an obtained sum of ranks for (11.18) 


в ——— 
fee +1 а U test) 
3 


where U; — one of the sums of ranks 
Na, № = replications in samples A and В 
N = total number of cases = №, + Ny 
N; = number of cases corresponding to U; 

As usual, Z is interpreted in terms of the unit normal distribution curve. 
For very small samples, one or both of which is smaller than 8, Mann and 
Whitney provide tables of probabilities. The U test is said to be more 
powerful than the median test. It should not be used if there are too many 
tied ranks. E 

Other Nonparametric Tests. The examples in this section by no means 
exhaust the list of distribution-free statistics. There are others having to do 
with differences in central value of two or more samples. Some are based 
upon other principles than we have seen above—on principles of matching 
and of runs, for example. There are also tests of independence of two varia- 
bles and of significance of correlation. Two of the latter—the rho coefficient 
of correlation and the /au coefficient of Kendall—are both based upon ranks 
and will be mentioned in Chap. 13. For more complete coverage of the 
various nonparametric methods, see Moses, as well as Walker and Lev. 


Exercises 


In each of the following exercises state your inferences and general conclusions in con-- 
nections with each solution. 
1. Compute a chi square for the contingency table in Data 114. 


Data 114. Numpers or Two Groups DIFFERING IN ABiLITY WHO PASSED A 
CERTAIN Test ITEM 


Failed.. 
Both. 


1 Mann, H. B., and Whitney, D. R. Ona test of whether one of two random variables is 
stochastically larger than the other. Ann. math. Statist., 1947, 18, 50-60. 

Moses, L. E. Non-parametric statistics for psychological research. Psychol. Bull., 
1952, 49, 122-143; Walker and Lev, op. cit. 
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2. Compute a chi square for the contingency table in Data 118, 


Dara 11B. NuwBER OF Persons IN Two Groups, DEPRESSED AND Not DEPRESSED 
IN TEMPERAMENT, WHO RESPONDED IN Елсн OF THREE CATEGORIES TO THE 
Question, “Ұоџір You RATE YOURSELF AS AN IMPULSIVE INDIVIDUAL?" 


Group Yes ? No | Totals 


72 45 133 250 
106 35 109 250 


178 80 242 500 


Depressed. ............ 
Not depressed. . 


3. In polling 48 interviewees, we find that 28 favor a certain routing of a freeway. Is it 
likely that this represents a majority vote in the population in the same direction? Use 
chi square, assuming a random sample. 

4. In an experimental group of 15 who were inoculated, two developed a cold within a 
specified time period whereas in a control group of the same size, nine developed a cold. 

a. Determine chi square with and without Yates's correction. 
b. Make a test using Table N. 

5. Make a chi-square test for Data 11B, combining the “Yes” and “?” categories, 
Compare the results with those in Exercise2. Can you account for the difference? What 
conclusions would be probable if the “?” and “N” categories were combined? 

6. In 13 identical-twin pairs, 10 pairs had two criminals, the remaining pairs having one 
criminal each. In 17 fraternal-twin pairs, three pairs had two criminals, the remaining 
pairs having one criminal each. Set up a contingency table and compute a chi square. 

7. On the application of a certain test before therapy, 25 of an experimental group were 
above the general median score and 15 were below. After therapy, 16 were above the 
median and 24 were below. Eleven were above the median both before and after. Set up a 
contingency table and compute chi square. 

8. The variances from three samples were 142, 117, and 85, with JV's of 16, 11, and 21, 
respectively, Apply Bartlett's test of homogeneity of variance. 

9. In three pairs of independent samples, differences between means, M; — M», equaled 
24, 1.7, and 5.2. The probabilities (one-tail tests) associated with these differences were 
12, ,35, and .015, respectively. What is the probability that such a combination of differ- 
ences could have occurred by chance? 

10. Apply the sign test to the first 15 differences in Table 9.5. 

11. In three samples the observations were: 

A. 9, 7, 2, 10,5, 8 

B. 10, 15, 12, 11, 16, 6 

C. 18, 15, 14, 20, 10, 13 

a. Apply the median test to all three samples. 

b. Apply the median test to all pairs of samples, using the same median as in part a. 

12. Apply the sign-rank-difference test to the same data as were used in Exercise 10. 

13. Apply the composite-rank test to the pairs of distributions given in Exercise 11, 


Answers 
1. x? = 3.96; df = 1. 
2. x? = 10.12; df = 2. 
3. x! = 1.33; df = 1. 
4. a. x? = 7.03 (without correction); df = 1. 


x? = 5.17 (with correction). 
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b. From Table N, р Z .05. 


„ xt = 4.63; df = 1. 
„ x? = 8.27 (with correction); df = 1. 
. x? = 4.26 (without correction); df = 1. 


x? = 3.37 (with correction). 


B. = 2.575+; C = 1.0324; В = 2.49; df = 2. 
. x? = 14.74; df = 6. 

. = 1,471/16,384 = .090 (one-tail test). 
XA versus B versus C) = 9.34; df = 2. 


x*(A versus B) = 6.67; df = 1. 
x*(A versus C) = 8.67; df = 1. 
х?(В versus C) = 3.33; df = 1. 
Т = 14,5; .01 < p < .02. 


RCA versus B) = 25.5; .02 < p < .05 (R = 78.0). 


К.(А versus C) = 21.5; p < 01. 
RB versus C) = 31.0; p > .05. 


[сн. 11 


CHAPTER 12 


INTRODUCTION TO ANALYSIS OF VARIANCE 


It frequently happens in research that we obtain more than two sets of 
measurements on the same experimental variable, each under its own set of 
conditions, and we want to know whether there are any significant differences 
among the sets. We could, of course, pair off two sets at a time, pairing each 
one with every other one, and test the significance of the difference between 
means, or other statistics, in each pair. 

Perhaps the variation of condition has been a qualitative one; for example, 
we have test scores for children from each of five neighboring states, or we 
have simple-reaction-time measurements under four different verbal instruc- 
tions. Every other variable thought to be significantly related in a determin- 
ing way to the experimental variable has been held constant. Perhaps the 
variation is a quantitative one, for example, retention scores obtained after 
different proportions of time spent in memorizing by the anticipation method 
versus the reading method, or arithmetic scores of children who have devoted 
different proportions of class time to drill in number operations versus con- 
crete applications of numbers. 

One practical problem involved in testing for significance of differences is 
the amount of labor involved. Five samples involve 10 pairs; six samples 
involve 15 pairs; 10 samples involve 45 pairs; and so on. There is a possi- 
bility that none of the differences between pairs would prove to be significant, 
In meeting this situation, it would be desirable to have some over-all test of 
the several samples simultaneously to tell us whether any of the differences 
were significant. If the answer is “Yes,” we can then examine pairs to see 
just where the significant differences are. If the answer is “ No,” our search is 
over without further ado. 

There are more important logical and statistical reasons for wanting a single 
composite test. If we happened to have as many as a hundred differences to 
be tested, and if we found one of them significant at the .01 level and approxi- 
mately five of them significant at the .05 level, we should actually conclude 
that none of the differences is significant. We could even have a few more 
than these meeting the significance standards due to chance. We should 
expect even the large differences of being due to chance unless we have an 
‘excess number of them. A simultaneous test should be of such a nature that 

257 


258 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION [сн. 12 


we can conclude whether the whole distribution of obtained sampling statis- 
tics could have happened by chance. 

There is still another statistical reason for wanting to treat the data 
together. If we tested each pair separately, we would use as an estimate of 
the population variance only the data from the two samples involved. If 
we make the null hypothesis apply to all the samples—that they all arose by 
random sampling from the same population—we could use all the data from 
which to make a much more stable estimate of the population variance. We 
should have to assume, of course, that the variances from the different sam- 
ples are homogeneous. To satisfy ourselves on this point we could apply 
Bartlett's test, which was described in Chap. 11. 

Although we saw in the preceding chapter some attention given to these 
problems of composite tests of significance, the methods described there have 
limited application. The reason is that when we can make the appropriate 
assumptions there are more powerful parametric tests available. These 
come under the general heading of analysis of variance. 


ANALYSIS IN A ONE-WAY CLASSIFICATION PROBLEM 


Consider again the case in which we have several samples of the same 
general character and we want to determine whether there are any significant 
differences among the means. The basic principle of such a test is to deter- 
mine whether the sample means vary further from the population mean than 
we should expect, as compared with the variations of single cases from the 
same mean. 

Two Estimates of Population Variance. The amount of variation of single 
cases from the population mean is indicated by the statistic 5°, which is our 
estimate of the population variance, or parameter 52. The variation of 
randomly sampled means about the population mean is indicated by the SE 
of the mean squared, which is denoted by d and is estimated by the ratio 
0°/т, where т is the size of each sample. If we multiply this ratio by u we 
obtain 2^, the population variance. 

Inother words, we have a way of estimating the population variance from 
the variance among means. If there is no significant variation among the 
means, if they arose by random sampling from the same population (or from 
populations with equal means), the population variance estimated from them 
should be the same as that estimated from the single observations. The test 
for determining the significance of the differences between two variances is 
the F test, which was described in Chap. 10. With appropriate df applied to 
the two variances being compared, we can interpret F as being significant 
or not. 


* In connection with analysis of variance we shall generally use n to stand for the number 
of cases in a subsample and N to stand for the number of cases in all subsamples in the 
problem combined. 
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Between Sum of Squares and Between Variance. Our attention will be 
directed next to the operations by which the two estimates of population 
variance are achieved, one from the means and one from the single observa- 
tions. We have already seen that there is a basis for estimating the popula- 
tion variance from the means. The computational steps will now be 
described. 

Suppose that we have k samples, or sets, of э cases each, where s is a con- 
stant. For each of the ꝶ means we should have the deviation 


d=M,— М, (Deviation of a set mean from the grand mean) (12.1) 


where M, = mean of a set, where sets vary from 1 to k, and M, = grand 
mean; mean of means; also mean of observations in all sets combined. 

If we squared all the deviations d and summed them, we should be on the 
way to finding the variance of the means about the population mean. This 
variance is analogous to a variance error of the mean, which is the square of 
the SE of the mean. This variance is not exactly what we want. We want 
an estimate of the variance of individual cases about the population mean, not 
the variance of the means. 

We ordinarily compute a variance from a sum of squares. The sum of 
squares that we want is given by 20°. This can be made more reasonable 
by saying that each d value is shared by all u cases in the set from which it 
comes, It is as if we gave all the cases in that set the same deviation value. 
In estimating the variance of individuals from the mean we need as many 
deviations as there are persons. Thus, the expression n 2d? is an estimate of 
the sum of squares of deviations of all individuals from the population mean. 
Since it is derived from the means, it is called the between sum of squares. 

A variance is often called a mean square; it is a mean of the squares, which 
implies division of the sum of squares by the number of things squared. In 
estimating population variance, however, to overcome bias we divide instead 
by degrees of freedom, There are k deviations d involved, from which we 
have k — 1 degrees of freedom. One degree of freedom is lost in using the 
computed grand mean М, The between variance is therefore computed by 
the equation 
ud: 
7 1 


Within Sum of Squares and Within Variance. If we may assume that the 
variances within the different samples are equal, except for random fluctua, 
tions, we may combine the sums of squares from all sets in order to obtain 
from this source an estimate of the population variance. As we combine 
sums of squares, we also combine degrees of freedom by which to divide the 
sum of squares. In each sample the number of df isn —1. In k samples 
combined we have k(m — 1) df. This can also be expressed as (N — k) df, 
since N = kn. 


(Between variance or mean square) (12.2) 


V, 
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In terms of a formula, the within variance, or within mean square, is esti- 
mated from the within sum of squares by the equation 
=x, Dx, 


= CBD) = Vor (Within variance or mean square) (12.3) 


Vw 
where x, = a deviation of an observation from its sample mean and other 
symbols are as defined previously. 

Taste 12.1. WORK SHEET FOR THE ANALYsIS OF VARIANCE IN Four SETS OF 


MEASUREMENTS ON THE GALTON BAR 
The Measurements (X) 


—— 


Set I Set II Set IM Set IV 

114 119 112 117 

115 120 116 117 

111 119 116 114 

110 116 115 112 

112 116 112 117 
zx, 562 590 571 577 2,300 zx 
M, 112.4 118.0 114.2 115.4 115.0 Me 


Deviations within Sets (æ.) 


+1.6 +1.0 —2.2 +1.6 
+2.6 +2.0 Аз +1.6 
—1.4 +1.0 +1.8 —1.4 
-2.4 —2.0 +0.8 —3.4 
—0.4 —2.0 —2.2 +1.6 


Squares of Deviations within Sets (x*.) 


2.56 1.00 4.84 2.56 
6.76 4.00 3.24 2.56 
1.96 1.00 3.24 1.96 
5.76 4.00 0.64 11.56 
0.16 4. 4.84 2.56 . 
17.20 14.00 16.80 21.20 69.20 Xa, 
Deviations of Set Means from Grand Mean (d) 
d —2.6 43.0 —0.8 40.4 
а 6.76 9.00 0.64 0.16 16.56 х2? 
ndi 33.80 45.00 3.20 0.80 82.80 mnzd* 


The Solution of an Analysis-of-variance Problem. In Table 12.1, we have 
four sets of observations made by the same individual on the Galton bar. 
With a constant horizontal line of 115 mm., the subject adjusted another line 
to seem equal to it. The four sets were obtained under four different arrange- 
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ments of conditions under which the adjustments were made. Is it likely 
that the observations all came by random sampling from the same general 
* population" of adjustments, or were there systematic differences among sets 
sufficient to say that the data are really not homogeneous? The following 
steps are followed in the solution of the type in Table 12.1: 


Step 1. Compute sums and means of the sets; also the grand total 2X and 
the grand mean M.. 

Step 2. For every set, compute the deviations from the set mean M,. These 
are equal to (X — M,) and are called x. 

Step 3. Square the deviations within sets to find each x°. Sum these to 
obtain Za, the sum of the squares of deviations within sets. 

Step 4. For each set, compute d, which equals (M, — M. 0. 

Step 5. Square each d, and find » Ads. 


With these calculations completed (see Table 12.1), we have the values we 
need for formulas (12.2) and (12.3). The Dx?, is 69.20, and the u Zd? is 82.80. 
Dividing these by theappropriate degrees of freedom, we obtain the variances. 
For this purpose, we set up Table 12.2. Listing first the degrees of freedom 


Taste 12.2. THE TOTAL VARIANCE IN THE GarrON-BAR DATA SUBDIVIDED INTO 
Two COMPONENTS 


Sum of | Degrees of 


Components squares | freedom Variance 
Between sets. 82.80 3 27.60 
Within sets. . 69.20 16 4.325 

Total. 152.00 19 
21.6 
Е 425 6.38 


and sums of squared deviations for “between sets" and dividing, we obtain 
27.60 as the variance estimated from the d's. For the corresponding values 
for “within sets,” we find 4.325 as the variance estimated from thez,'s. The 
F ratio is 27.6/4.325, which equals 6.38. The between variance is over six 
times as great as the within variance. 

Interpretation of the F Ratio. The significance of an F is determined by 
reference to Snedecor’s table (Table F, Appendix B). In using this table, we 
have to consider the two different df values. For the numerator of the F 
ratio (usually the larger variance), we look for the df, at the head of a column. 
For the denominator of F we look for the df; at the left of a row. In our 
illustrative problem, there is a df; of 3 at the head of a column, but there is no 
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df; of 16 at the left of any row. We must interpolate between the rows with 
headings of 14 and 17. Linear interpolation will usually yield a decision 
regarding significance level. 

By interpolation we find that an F of 3.24 with df of 3 and 16 is significant 
at the .05 point and an F of 5.29 is significant at the .01 point. Our obtained 
F is greater than that for the. 01 point, and so it may be regarded as very 
significant. 

Some Checks. It will be noted in Table 12.2 that we have recorded the 
total sum of squares and the number of df for the same. These have been 
found by summing the components in both instances, i.e., component sums 
of squares and component df. The total sum of squares is a composite of 
two independent contributors—that derived from differences between means 
and that derived from differences within Sets. In both instances, of course, 
the “differences” are expressed as deviations from the respective means, 

Tf we were to pool all the sets, the deviation of each measurement from the 
grand mean M, is itself a composite of two components. We can say that 


X — M: = (X — м) + (M, — M) 
or that * x. +d, 


where the subscript s indicates that the value or statistic belongs to a particu- 
lar set and M, is the grand mean. Since x is a composite of two independent 
components, the sum of squares of x is a simple sum of the two sums of Squares 
of the components x, and d. In equation form, 


Zx? = Уур, + nZd', 


where х = deviation from the grand mean, M, 

х, = deviation of X from the mean of a set, M, 

d, = deviation of M, from M, 
The double summation sign before 2?, indicates that the within-set deviations 
are squared and summed for each set, then these sums are summed over all 
sets. 

If Ух? is computed from the Complete data, it can be used to check the 
computed values of the between and within sums of Squares; it should equal 
the sum of the two. The number of df to be associated with Der is N — 1, 
and this should equal the sum of the two different df values for between and 
within sums of squares, In Table 12.2 we find this check satisfied. 

Formation of the F Ratio. In analysis of variance generally, the numerator 
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source of the variations involved in this term. It is also sometimes called the 
residual term, since its source is all that is left over after other sources have 
been accounted for. з 

It will almost always happen that the numerator term is larger, and 7 is 
therefore greater than 1.0. We are thus dealing with the right-hand tail of 
the F distribution in our interpretation of F. We have a one-tail test. 
Should Ё on rare occasion turn out to be less than 1.0, the conclusion is 
merely that we accept the null hypothesis. There is no need to consult the 
table of F for this kind of outcome. 

Making / Tests Following an F Test. А significant F tells us that there are 
nonchance variations among means somewhere in the list of sets; we do not 
know how many or which ones are significantly different. As a group they 
could not have arisen from a homogeneous list of samples. Further exami- 
nation would be needed to tell us where the significant differences are and 
what sources in the form of experimental variation have probably determined 
them. Conclusions concerning the last point, of course, go beyond statistical 
decisions, but the latter do or do not call for the effort to find such conclusions. 

There has been some difference of opinion as to how to interpret / tests 
made following an F test. If F is insignificant, of course, we should not apply 
any tests. Acceptance of the null hypothesis on the basis of an F test auto- 
matically accepts the null hypothesis for all pairs of means in the list, includ- 
ing the pairs with the largest differences. 

In the illustrative problem, the F ratio was significant beyond the .01 point. 
Are all the interpair differences significant? Probably not, for the differences 
range from about 1 for the difference M4 — M; to about 6 for the difference 
Ms — M.. 

We could proceed to apply Fisher's formula for /, given as formula (10.5), 
to each pair of means. In doing so, we assume the null hypothesis for each 
pair as we test it. We could save ourselves unnecessary work by being 
judicious in starting the tests. For example, if F is just barely significant 
at the .05 point, we might begin by testing the largest difference first and 
proceed with other differences in order of size until we come to one that is 
insignificant. If remaining differences are smaller, we should not need to 
test them. This would be safe, particularly if the samples have similar 
variances. If F is significant well beyond the .01 point, we might begin with 
the smallest difference and work up to the pair of means for which we find a 
significant difference, assuming that all differences as large or larger are also 
significant. 

If the variances within sets are quite uniform (this might be established to 
our satisfaction by making Bartlett’s test), we can save ourselves much addi- 
tional work. ‘The first work-saving step would be to use the within variance 
as our estimate of the population variance to apply to all pairs of means. 
This gives us a more stable estimate of population variance and only one 
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SE of a difference to compute. The latter is given by the formula 


21, (SE of a difference between means, from within 
тау = КИТ. variance) (12.4) 


"The next work-saving step is to find what differences would be significant 
at the .05 and .01 levels. We have 16 df, since we used all the df available 
within all sets. We find from Table D that the /’s significant at the .05 and 
01 levels with 16 df are 2.12 and 2.92, respectively. The SE of a difference, 
by formula (12.4), is found to be1.32. Computing the products of ова, апі 
Lncay, we find that differences of 2.80 and 3.85 are significant at these two 
levels. The four means found in Table 12.1 are 112.4, 118.0, 114.2, and 115.4. 
Of the six pairs, one is significant beyond the .01 level and two others are 
significant beyond the .05 level. 

From a really rigorous point of view, we are not justified in interpreting 
"s found after making an F test as if no test had preceded them. Even when 
F is significant, this procedure is somewhat like taking a word or a phrase out 
of context and interpreting it so. "There is some risk involved. A test that 
takes into account the entire picture, however, is rather complex. Tukey 
has presented one solution, which, though relatively simple, would take too 
much space to report here.! 

The Relation of ¢ to F. When we have only two sets of observations, as 
when we compare two means for significance, we can still make an F test. 
The between variance will have associated with it only 1 df. 

For this particular situation, when m = Nz, the sum of squares for between 
variance is given by the formula 


Le A И um otaua tetuan means tor (17 5) 

To illustrate, let us take the largest difference between means in Table 12.1. 
The two means are 112.4 and 118.0, and their difference is 5.6. Applying 
formula (12.5), we find 78.4 for the sum of squares. The within sum of 
squares is a sum of 17.2 and 14.0, from Table 12.1. With 1 df for the 
between sum of squares, the between varianceis 78.4, With 8 df within the 
Sets, the within variance is 31.2/8 = 3.9. The F ratio is 78.4/3.9 = 20.10, 
which is well beyond the .01 point. 

It has been proved that, with 1 df for the between variance, F = 7? for the 
same difference. In this problem, then, = 4/2010 = 448. If we compute 
* by means of formula (10.5), we arrive at the same value. 

Computation of Variances from Original Measurements. Just as we can 
compute standard deviations, and so variances, from original measurements 


1 Tukey, J. W. Comparing individual means in the analysis of variance. Biometrics 
1949, Б, 99-114, 
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without computing each deviation from the mean [see formula. (5.12)], so we 
can calculate the necessary constants for an analysis of variance. Such an 
approach requires us to square the original measurements. With a good 
calculating machine available, this is no large order, but with only pencil and 
paper it may amount to considerable labor. 


TABLE 12.3. SOLUTION OF AN ANALYSIS OF VARIANCE FROM ORIGINAL MEASUREMENTS 
(Without Determining Deviations from Means) 
Measurements (Reduced) (&“) 


Set I Set II Set III Set IV 
4 9 2 T 
5 10 6 1 
1 D 6 4 
0 6 5 2 
2 6 * 2 i 
(2х), 12 40 21 27 100 =X’ 
5.0 M; 
(2X), 144 1,600 441 729 2,914 z(zX^)* 
Squared Measurements (X^) 
16 81 4 49 
25 100 36 49 
1 81 36 16 
0 36 25 4 
4 36 4 49 
(2X"), 46 334 105 167 652 z(zx^), 


By a process of coding, we can bring the numbers down to small size. 
From each of the three-place numbers in Table 12.1, let us subtract the con- 
stant 110, leaving the remainders shown in the first part of Table 12.3. The 
variances will not be affected in the least by this transformation, for the new 
values, which we shall call X’, maintain the same distances from one another 
and from the means as before coding. 

For the general solution, without knowing deviations x or d, the sums of 
squares we need are found by the following procedures. The between sum of 
squares is given by the formula 


aeri. os (12.6) 


The within sum of squares is given by 


у-у ы wn 
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The total sum of squares is given by 


Ye-Y0 x) E Gn (12.8) 


The steps called for in applying these formulas are: 


Step 1. Sum the measurements X for each set, to obtain (ZX), for each set 
(see Table 12.3). Sum these values to obtain EX. 

Step 2. Square the sums of the scores to obtain (ХХ), for each set. Sum 
these values to find X(ZX)*.. 

Step 3. Square all measurements to find the X? values. Sum these values to 
obtain Zs. 


Applying the three formulas, by formula (12.6), 


2,14 10,000 mI 
ny a = 5 — 30 5828 — 500 = 82.8 


By formula (12.7), 


J =. = 652 — 259 = 652 — 582.8 = 69.2 
and formula (12.8), 
Ў з" = 652 — 19000 — 652 — 500 = 152 


A check for accuracy of computations is to see that » Xd? + 522, = Dx, 
The check is satisfied, for 82.8 + 69.2 = 152. A comparison of these values 
with those in Table 12.2 will show that we have arrived at the very same 
sums of squares. From here on, the computation of variances and F is just 
the same as before. 

When Samples Are of Unequal Size. The procedures described thus 
far apply to the special, but not unusual, case in which all samples are of 
equal size. Experiments can be planned that way, but sometimes available 
data do not fit that specification. With a little modification of the formulas, 
we can take care of problems in which » varies. 

For the between sum of squares, 

2 2 
iw uya y C (201 leeren e (1) 9) 
М in size) 

where m, = number of cases in a specified set 

M, = mean of that set 

M, = mean of all observations 
Other symbols are as defined in the preceding formulas and in Table 12.3. 
For all expressions involving subscript s the summation is made over £ sets. 
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For the within sum of squares, 
by (X), Within f h 
L DL) Le рачуну (210 


where the symbols mean the same as in formula (12.9). 
For the total sum of squares the formula is the same as when we have 
samples of equal size; hence formula (12.8) will apply for the general case. 
The degrees of freedom are the same as in the case of equal u's for the 
total and between sums of squares. The df for within sum of squares equal 
Z(n, — 1). 


ANALYSIS IN A Two-way CLASSIFICATION PROBLEM 


In the preceding kind of problem the sets of data were differentiated on 
the basis of only one experimental variation. There was only one principle 
of classification, one reason for segregating data into sets. 

In a two-way classification, there are two distinct bases of classification. 
Two experimental conditions are allowed to vary from trial to trial, There 
may be several trials or replications under each combination of conditions. 
In the psychological laboratory a study of different artificial airfield landing 
strips, each with a different pattern of markings, may be viewed through a 
diffusion screen to simulate vision through fog, each at different levels of 
opaqueness. In an educational problem, four methods of teaching a certain 
geometric concept may be applied by five different teachers, each one using 
every one of the four methods. There would therefore be 20 combinations 
of teacher and method, and let us suppose that an equal number of randomly 
chosen pupils receive learning scores under each combination. 

Tabulation of Data in a Two-way Classification Problem. For an illus- 
tration of the procedure in this type of problem, we will assume an experi- 
ment on the relation of scores on a certain psychomotor test to the size of a 
target at which the examinee must aim. In conducting the experiment 
it is convenient to use three testing machines simultaneously in order to 
reduce the testing time. It is known that there are individual differences 
between machines, in this test, to the extent that it would be risky to attach 
one target size to one machine only throughout the tests. Machine differ- 
ences might make it appear that there were differences attributable to target 
differences or might by chance negate those differences. The target sizes 
were therefore combined with the machines systematically. There were 
therefore 12 target-machine combinations with five observed scores obtained 
with each combination. The scores (which are entirely fictitious for the 
sake of a good illustration) are tabulated in Table 12.4. This arrangement is 
typical and convenient for the operations of analysis of variance. The sums 
and means, as given, are also needed in the variance solution. 
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The Sources of Variance in a Two-way Classification Problem. We 
could, if we chose, proceed to perform an analysis of variance based upon 
the model of the one-way classification problem as already demonstrated. 


TABLE 12.4. Scores ОЕ 60 STUDENTS EARNED ON THREE DIFFERENT MACHINES OF 
A PsvcHOMOTER TEST, EACH WITH THE TARGET SIZE VARIED IN Four STEPS 


Tu dl | Sums for | Means for 
1 2 3 target size | target size 
6 4 4 
4 H 2 
A 2 5 2 
6 2 1 
2 3 1 
z 20 15 10 45 
M 4 3 2 3 
8 6 3 
3 6 1 
B fi 2 1 
5 3 2 
2 8 3 
z 25 25 10 60 
M 5 5 2 4 
7 9 6 
6 4 4 
с 9 8 3 
8 4 8 
5 5 4 
> 35 30 25 90 
M vi 6 5 6 
9 7 6 
6 8 5 
D 8 4 7 
8 T 9 
9 4 8 
z 40 30 35 105 
M 8 6 7 " 
Sums for machines. 120 100 80 300 
Means for machines. 6 5 4 5 


косе. TRUE узын ел ыйы ы. 
That is, we could take the 12 sets as if they represented categories based 
upon a single principle and test the 12 means collectively to see whether 
they could have arisen by random sampling from the same population. 
We shall see later what kind of answer could be obtained by this approach, 
but let us first see what is logically wrong with this kind of solution here. 
Suppose we did carry through the solution proposed and found an F 
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ratio that indicated significance beyond the .01 point. We should not 
know whether this was due primarily or solely to the differences between 
targets or to the differences between machines, or to both possible sources. 
Suppose, on the contrary, the F ratio indicated no significant differences 
among sets. We should not be sure that one of the experimental variations, 
perhaps target size, were not actually producing real variations that were 
either covered over or counteracted by the effects of the other experimental 
variations. We should have what is called a confounding of effects. We 
need some method that will segregate the variations associated with each 
of the experimental variables so that any significant differences at all will 
have a chance to emerge in the F test and so that we shall know to which 
source to attribute any significant differences found. 

Interaction Variance. The procedure about to be described makes possible 
this kind of segregation of the sources of variations. As a result, we can 
then determine whether differences among means owe their divergencies 
to target size or to machine differences, or to both. Not only that, when 
there are two possible sources of variations, there is also a possibility of 
what is called interaction variance. 

The phenomenon is well named. Interaction variations are those attrib- 
utable not to either of two influences acting alone but to joint effects of the 
two acting together. If it turned out that the larger the target, the larger 
the scores tended to be, that is one direct and isolable effect. If there are 
systematic machine differences so that among three there is a most * difficult 
one (yields lower mean scores) and an easiest one (yields higher mean Scores), 
that is another distinct effect. There may be effects of target size and 
machine over and above these. It is conceivable, but not very probable, 
that one machine, apart from its general difficulty, gains in difficulty by 
virtue of its having one size of target rather than others. It may be the 
coincidence of machine and target size that produces systematic variation 
in one direction from the general mean of scores. This is an example of 
interaction variance. 

Interaction variance might be more reasonably expected in combination 
of teacher and instruction method; of kind of task and method of attack 
by the learner; and of kind of reward when combined with a certain condition 
of motivation. 

It is possible to determine whether there is a significant amount of inter- 
action variance present by making an F test for it as well as for the separate 
main effects. 

The Residual Variance. There are three F tests to make, therefore, in 
place of one. The remaining variance is known as the residual variance, 
that within sets. It supplies the basic, or residual, estimate of variance 
after the three sources of variations have been removed, and it serves as 
the denominator for all three F tests. It is sometimes called an estimate 
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ol the error variance for the reason that it represents the influences of many 
unknown and uncontrolled sources. A perfect experiment would pre- 
sumably control all contributing factors until within cach set of data observed 
under a specified combination of conditions there would be no longer any 
variations; each observed value in a set would be the same, Most experi- 
ments are so imperfect that there is appreciable error variance. 

Estimation of the Variance from Different Sources. Two solutions will 
be described, one using deviations of observed values and of means of sets, 
from the various appropriate means, the other using original measurements 
and means, An attempt is made to summarize the operations in terms of 
formulas, as usual, but here the symbolizing of concepts becomes so involved 
that formulas may be more confusing than helpful. Some readers may find 
it easier to follow the examples as models rather than to apply the formulas, 
The system of symbols employed in the formulas is given in Table 12.5, 


Танк 12.5, Вүмвошс Ѕсикмк FOR тик VALURS IN A TABULATION PREPARATORY 
то Anatvers оғ Variance 1х A Two-way CLASSIFICATION. PROBLEM 


Let Xy = any one of the cell entries, Xo, Xu... Ха 
Ma = any one of the set means, Ma, Ma... Ма 
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is table provides only three columns and three rows, but it could be 
in the directions shown to take care of any number of columns 


rows. 
_ The Solution Based upon Deviations. In what follows, consistent with 
‘the symbols in Table (12.5), a subscript & stands for a particular column 
we might have used c for column, but there would be danger of confusing 
with a particular row—row C), and r stands for a particular row. There 
are three columns, 1, 2, and 3, in the psychomotor-test problem, and four 
rows, A, B, C, and D. The symbol Xy stands for any one observation in 
row r and column 4 and M,, stands for a mean of the five observations in a 
| described as being in row r and column 4. In the following, n stands 
the number of observations within each set; in the illustrative problem 
5. The number of rows is symbolized by r and the number of columns 
by А. The subscript ¢ refers to the total distribution, all sets combined, 
"Thus, M, stands for the mean of the composite, and х, stands for a deviation 
of any X from M,. 

"The total sum of squares is given by the equation 


Ia’, = (Ху — М) (12.41) 
Applied to the data of Table 12.4, 


женнен аена Т 


*0-5--4-5-8-5' 
(from last row of observations in Table 12,4) 


* „% „„ 
E 
= 374 (total sum of squares) 


"The sum of squares between rows is given by the equation 
А й", = nhi Z(M, ~ М) (12.12) 
Applied to the same data, 
їй, = 5 х NGN 4 (4 e- e- 

= 190—2)" + (7-1)! + 1 + 2) 

- 15x10 

= 150 (sum of squares between rows) 
bee columns le given by the equation 


í XP, = "ҮМ, — M) (12.13) 
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Applied to the data of Table 12.4, 


Zd’, = 5 X 4[(6 — 5)? + (5 — 5)? + (4 — 57] 
= 20[1° + (-1)] 
=20X2 
= 40 (sum of squares between columns) 


The interaction variance can be estimated in several ways. Perhaps 
the most common way is to derive it from the sum of squares between all 
sets, eliminating the sums of squares between columns and between rows. 
We already know the last two sums of squares. We proceed next to compute 
the sum of squares between sets. The formula is similar to the numerator 
of formula (12.2) but with different notation to fit the new system. 


Ed? = (М — M] (12.14) 


The symbol d, refers to a squared difference between any set mean and 
the total mean M.. The subscript rk implies that all rows and all columns 
are involved. Applied to the illustrative data, 


Zd’, = 5[(4 — 5)* + (3 — 5)* + (2 — 5) (гот first row of means) 
+ ‚——U— ot om on m o ЯГ 2 on n n on n n 
+ (8 — 5)! + (6 — 5) + (7 —5)] (from last row of means) 
= S70? + (721 + (—3) 
4 "CR 
+ 32 ＋ 12 ＋ 24] 
= 5 Х 42 
= 210 (sum of squares between means of sets) 


И we remove from the entire sum of squares for the 12 set means the sum 
of squares attributable to columns and to rows, we have left the interaction 
sum of squares. By formula, 

Хай, = Ed’ — Zd', — Xd*, (12.15) 


in which Tdi (the subscript reads r times E, for reasons that will be ex- 
plained) stands for the interaction sum of squares. For the illustrative 
problem, 
Zd? xe = 210 — 40 — 150 
= 20 (interaction sum of squares) 


Another, more direct, way of deriving interaction sums of squares utilizes 
the formula 


Zd? xe = (M. — М, — M, + M] (12.16) 


in which M; is the mean of the column in which each particular Му appears 
and M, the mean of its row. For the illustrative problem, 
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Dra = 5[%—3—6-+5)#+(3—3—5+ 5)? (from first row of means) 
z ОЗТ E 
+(6—7—5+5)?+ (7 —7—4-- 5)'] (from last row of means) 

= 51000: +(—1)? 1] 
=5X4 
= 20 (interaction sum of squares; alternative solution) 


The sum of squares within sets is computed by the formula 
2x, = X(Xg— Ма)? (12.17) 


This formula, with new symbols, requires the same operations as formula 
(12.3) given in connection with the single-classification problem. Applied 
to the psychomotor problem, 
Za = (6 — 4 ＋ (4 — 4)#+ (2 — 4)#+ (6 — 4) + Q—4* 
(from set A1) 
+ en -o ave 06 ep | Oe ЕИ A RSs le TUS 
+ (6 7) (5 5) ＋ (7 — 7)#++ (9 — 1): + (8 — 7) 
(from set D3) 
= 164 (sum of squares within sets) 


We can now check the solution of this by deducting all previously computed 
sums of squares from the total sum of squares, and we have 


374 — 40 — 150 — 20 — 164 


We could compute Хл?, by this elimination process without going through 
the arduous arithmetic involved in using (12.17), but for checking pur- 
poses it is very desirable to derive all the component sums of squares sepa- 
rately and then check the results. 

Degrees of Freedom. Before taking the next important step of esti- 
mating population variance from these different sources, we need, as usual, 
the degrees of freedom. Starting with the largest source, the total sum of 
squares, we have, as usual, (W — 1) df, or 59. This figure is to be subdi- 
vided among the contributing components. The sum of squares among 
the means of sets should have allotted to it the number of sets minus 1, or 
12 — 1 =11df. These 11, in turn, are to be allotted to three sources 
Rows have the number of row observations (row means) minus 1, or 


4—1-23 


Columns have, by analogy, 3 — 1 = 2. This leaves 6 out of the 11 fo 
interaction. This 6 degrees is 3 X 2, the product of the df for rows and 
columns, each source taken separately. This is consistent with the idea of 
interaction itself, whose contributions to variations may be regarded as the 
products of two sources. This is why we use the subscript r X & when 


274 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION icH. 12 


referring to interaction. Having taken care of the special sources of varia- 
tions, the remainder, or 59 — 11, gives us the df left for within-sets sums of 
squares. This number of df may also be determined directly from a sum- 
mation of df within sets, Since there are 12 sets and each contains 4 df, 
we have 12 X 4 = 48 df for the residual variance. 

In terms of symbolic descriptions, the degrees of freedom may be given 
as follows: 


Source Degrees of Freedom 
Between rows r—1 
Between columns k—1 
Interaction (r—1)(&—1) 
Within sets N — rk — rk(n — 1) 
Total N-1 


The F Ratios. We are now ready to estimate the variances and to com- 
pute the F ratios. These are systematically arranged in Table 12.6. There 
are four different estimates of population variance—50.0, 20.0, 3.33, and 
3.42. We compare the first three, since they represent possible special 
contributions resulting from varied experimental conditions, each with the 
fourth. The fourth presumably represents variations of the phenomenon 
measured freed from possible influences of the experimental variations. Do 
the first three differ significantly from the fourth? 


TABLE 12.6. SOURCES OF VARIANCE IN THE PsvcHoMOTOR-TEST DATA ANALYSIS 
AND F Ratios 


Sous Sum of | Degrees of | Estimate of 
squares | freedom | variance 
Target size () . „ш... 150 50.0 
Machine (M)... 20.0 
Interaction (T X M) 3.33 
Within sets. . 3.4 


Required F 
P=05 P=.01 
50 
F for targets = 342 14.62 2.80 4.22 
F for machines = E = 5.85 3.10 5.08 
F for interaction = 333 — 097 230 320 


The F ratios are given below the table, together with the F's required for 
significance at the 05 and .01 Points as determined from Snedecor’s table 
(Table F). From these results it appears that variations in target size 
definitely carry with them systematic variations in test score. There is a 
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law of relationship fairly well established between target size and difficulty 
of the test. The F ratio for machines is significant beyond the .01 point, 
leaving us with considerable confidence that the machine differences, as 
such, have a real bearing upon the difficulty of the task. 

This conclusion is in some doubt because of possible failure of experi- 
mental design, however. Since the examinees were different groups for 
the three test machines, we cannot be sure that some real differences of 
ability have not combined with minor machine differences to give an appar- 
ently significant machine difference. A matching of examinees for machines 
might have improved the precision of the experiment. This would have 
entailed modification in the analysis-of-variance operations. The F for 
interaction proved to be rather decidedly insignificant. There is no reason 
to believe that changing target size has different effects depending upon the 
machine with which it is associated. 

Removal of Sources of Variation. It may illuminate the concepts of 
different kinds of variance and the way in which they contribute to total 
variance in the sample if we separate them in another way. 

Table 12.74 shows the 12 means of sets for the psychomotor-test data. 
Variations among them are due to the three possible sources—target differ- 
ences, machine differences, and the interaction of the two. The possible 
effects of target size are most apparent in the means of the rows—3, 4, 6, and 
7. 'The possible effects of machine differences are most apparent in the 
means of the columns—6, 5, and 4. The possible interaction variance is 
obscured. It possibly contributes both to the means of rows and of columns; 
we do not know. Let us strip away first the variations attributable to 
machines and then that attributable to targets and see what variations 
are left. 

The mean of all observations is 5. Any deviation of a column mean 
from 5 indicates a constant error for a particular machine. Machine 1 
gave a mean of 6, indicating that machine 1 had a constant error ODE 
Machine 2 apparently had no constant error, while machine 3 had a con- 
stant error of —1. If we deduct from each cell or set mean in column 1 the 
amount of constant error involved for machine 1, we should presumably 
remove from the means in column 1 the influence of machine 1 as a source 
of variation. We can do likewise for column 3, deducting the constant 
error of —1, which is equivalent to adding +1 to each mean. We need 
do nothing for column 2. The results of these operations are shown in 
Table 12.7B. The means of the columns are now all 5, to agree with the 
composite mean, Mi. The means of the rows have been unaffected (they 
are still 3, 4, 6, and 7) because the changes in one column are compensated 
for by changes in reverse direction in another column. The cell values in 
Table 12.7B still have in them the variance attributable to targets and to 
interaction variance. 
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Next we remove the target variance. The constant errors for rows are 
—2, —1, 1, and 2, respectively. Deducting these from the values in their 
respective rows of Table 12.7B, we have the results in subtable C. The 
means of the rows as well as of the columns are now all 5. But within four 


TABLE 12.7. ANALYSIS OF THE BETWEEN-sETS SUMS OF SQUARES IN THE PSYCHOMOTOR- 
TEST DATA INTO THREE COMPONENTS BY SUCCESSIVE REMOVAL OF CONTRIBUTING 


Sources OF VARIATION 
(nor o EEUU nS —— 


Column 


> м 
1 2 3 


A, Original Matrix of Means of Sets 


A О ТЫС ЖОГЫ] es 
B se cs 86% H ж 
с трио es e 
D Siti e st wants | 
Pa RET ДУ) TR P RR RR 
M| 6. 5 


A Se ese es 9 [з 
B 1 Сш Т 
с 6s eec 0 tao ooo, 
D 7 $518 Е ЕЯ 
z| 20 | 20 | 20 | 60 
“| 5 5 5 5 


C. With Variations Associated with Target Size Also Removed; Only Interaction 
Variance Remaining 


A 5 5 5 15 5 
B 5 6 4 15 5 
с 5 5 5 15 5 
D 5 4 б АВ ШК 5 
2 20 | 20 | 20 | 60 
M| 5 5 5 5 


— дши ЦЬ | 
cells there are departures from 5. "These are possibly the interaction devia- 
tions, depending upon whether or not they prove to be significant. Machine 
2 would seem to favor high scores when coupled with target B and to favor 
low scores when coupled with target D. Machine 3 has a reverse tendency. 
But the F showed these deviations to be insignificant. There seem to be 
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no good, logical reasons to expect any systematic coupling of target and 
machine. In other problems there may be significant interaction effects. 

A Modified Error Term. The finding of insignificant deviations among 
the residual means suggests several things. One is that these variations 
are random-sampling effects that really belong to the within variance but 
were not pulled out with it. There are good reasons, therefore, for com- 
bining this source of variance with that from within sets. The sum of 
squares for this was 20. Combined with that from within sets, this gives a 
total of 184. We also combine degrees of freedom. With 54 d/, we have 
a trivial change from 3.42 to 3.41, which makes no difference in the F ratios. 
In other situations the changes might be much greater. Such a modified 
error term should be used when the F for interaction is not significant. 

Solution rrom Original Measurements. Next will be given the formulas 
and their applications for the solution of sums of squares without computing 
means and deviations. With small integral numbers to start with, or 
numbers coded to such magnitude, these procedures are often more con- 
venient than those utilizing deviations. The first solution, with deviations, 
is more meaningful to the beginner. In the following exposition, each 
formula will be stated and then immediately applied to the psychomotor-test 
data. 

Total sum of squares: 


р? , — OX 
y= PL. (12.18) 
=6+4+44 (from first row of Table 12.4) 
+ а ж ж 
+92 + 42 + 82 (from last row in Table 12.4) 
_ (300)? 
60 


= 1,874 — 1,500 = 374 


Sum of squares between sets: 


Yo. ig н Ж Gr (12.19) 
= L$[Q0? + 15° + 10° (from first Х row of Table 12.4) 
+ Cyne 19) mratid pesi > 
+ 40? + 30? + 352)] (from last Х row of Table 12.4) 
_ (300)? 
60 
= 1,710 — 1,500 


= 210 (sum of squares between sets) 
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Sum of squares between rows: 


Z(zX)! (ХХ)? 
Ny a, = FERN E 2 (12.20) 
= [}45(45? + 60° + 90° + 1050 — 1,500 
= 1,650 — 1,500 
= 150 (sum of squares between rows) 


Sum of squares between columns: 


ye. ( XW): (хха)? (12.21) 
nr N 
= [140(120? + 100° + 80°)] — 1,500 
= 1,540 — 1,500 
= 40 (sums of squares between columns) 
Sum of squares for interaction: 
d = Ed’, — Ade, — Xd', (12.22) 
= 210 — 150 — 40 
= 20 (sum of squares for interaction) 
Sum of squares within sets: 
Tr, = Ух, — Xd, (12.23) 
= 374 — 210 
= 164 (sum of squares within sets) 
It will be noted that the correction factor (XM, which appears in 
most of these equations, is identical and once computed will do thereafter. 


The sums of squares by this method are seen to be identical with those 
found by the preceding method. The estimation of the population variance 


from each source and the application of the F test are the same as before- 


(see Table 12.6). 

A Two-way Classification Analysis without Replications. Occasionally 
there arises the kind of research problem in which there are two experi- 
mental variations but only one observation for each combination of condi- 
tions. This kind of problem will be illustrated by the use of ratings. The 
data in Table 12.8 will be utilized. 

In these data, three raters have given their ratings of each of seven indi- 
viduals in a single trait. The procedure of analysis is much like that previ- 
ously illustrated when there are replications. The main difference is that 
the interaction and error effects are not segregated here, since there is no 
basis for doing so. The error term is derived from this combined source. 

The total sum of squares is computed the same way as by formula (12.18), 
which need not be repeated here. Applied to the data of Table 12.8, 


Хх, = 720.00 — 618.86 = 101.14 
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TABLE 12.8. APPLICATION OF ANALYSIS OF VARIANCE IN A TWO-WAY CLASSIFICATION 
WITHOUT REPLICATION У 


(ХХ)? = 4,394 УХ; = 720 


The sum of squares between rows is given by the formula 


(ZX.  (ZX4* 
»e IAM - UT (12.24) 


where А = number of columns, r = number of rows, and other symbols are 
as defined in preceding formulas. Applied to the data of Table 12.8, we 
have 


le- 200 — 12,996 _ 670,00 — 618.86 = 51.14 


The sum of squares between columns is given by 


x(IX)! _ (2X4) 
ye а Nut = MEME (12.25) 


where the symbols are as defined above. Applied to the data of Table 12.8, 
this gives 


Ye - ын — 618.96 = 8.85 


The sum of squares for the remainder is obtained by deducting the last 
two sums of squares from the total sum of squares. We therefore have for 
the remainder sum of squares, 


Ext, 101.14 — 51.14 — 8.85 = 41.15 
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We are now ready to estimate variances and compute F ratios. The work 
is summarized in Table 12.9. Both F ratios prove to be insignificant. We 
therefore do not reject the hypothesis that there are no differences between 
raters and between ratees. There may be such real differences, but our F 
tests fail to show them. We should not be very surprised to find no sig- 
nificant differences among raters, except as some of them show marked 
errors of leniency in rating ratees and some do not. We should be surprised, 
however, not to find significant differences among ratees, for individual 
differences in most traits are the almost universal finding. With a larger 
sample, the statistical test might have been sensitive enough to yield a 
significant F for ratees. 


TABLE 12.9. ESTIMATED VARIANCES AND F RATIOS FROM THE DATA OF TABLE 12.8 


- Sum of E 
Source squares df 4 Е Р 
8.52 2.48 2.05 


Ratees (rows) 51.14 6 
Raters (columns) p: 
Remainder. 


The smallness of sample, however, is not the whole story behind the 
insignificant F’s. Note that it was stated at the beginning of this section 
that the error term includes contributions from interaction. If the inter- 
action effects are of sufficient importance, they inflate the variance computed 
from the residual sum of squares and thus reduce the size of both F ratios. 
We know that there are often halo errors, which can be defined statistically 
as interaction effects—between rater and ratee. We should not be able to 
segregate this interaction effect without having independent replications, 
which would be difficult to obtain, or without having ratings made by the 
same raters of the same ratees on other traits.! 

Another reason for the small variance among ratees is the lack of agree- 
ment among the raters. Some of this can be attributed to halo errors. 
In the extreme case, if there were zero correlations among the raters’ ratings, 
the means of the ratees would tend toward equality, or no variance at all. 
The higher the intercorrelation of raters, the greater will be the variance 
estimated from between ratees. To this problem of inter-rater correlation 
we turn next. 

Intraclass Correlation. From the data of Table 12.8 we can use the 
information already extracted about variances from which to compute cor- 
relations between raters, The average intercorrelation thus obtained is 


1 For further treatment of ratings by analysis of variance, see Guilford, J. P. Psycho- 
metric Methods. 2d ed. New York: McGraw-Hill, 1954, Pp. 281-288. 
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known as an intraclass correlation. This correlation is given by the formula 
pee Jr Ws 
eee (E S DV, 
where V, = variance between rows, where each row stands for a person 
V, — variance for residuals (or error) 


k — number of columns 
For the data of Table 12.8, 


(Intraclass correlation among Ё series) (12.26) 


8.52 — 3.43 


te = 557 £2643) 7 


This result indicates that the average of the intercorrelations of the three 
sets of ratings is .33. If we take the intercorrelations of raters to be an 
indication of reliability of ratings, we can say that the typica lreliability 
of a single rater’s ratings is of the order of .33. The actual correlations 
between single pairs might vary considerably from this figure because of 
sampling errors in such a small sample, 

If we want to know the reliability of a sum or mean of these three raters’ 
ratings in this population, a modified formula is available: 


ть = 5I (Intraclass correlation of a sum or average) (12.27) 
r 


in which the symbols are as defined before. Applied to the same data, 


_ 8.52 — 3.43 _ 


re = “559 Eo 


From this we infer that if we averaged the three ratings for each ratee 
and could correlate the set of averages with a similar set of averages, the 
result would be about .60. Averaging reduces the relative importance of 
errors, leaving the relationships enhanced. This principle of reliability 
will be treated at some length in the chapter on reliability of measurements 
(Chap. 17). 


GENERAL COMMENTS ON ANALYSIS OF VARIANCE 


Assumptions to Be Satisfied in Applying Analysis of Variance. Like 
most statistics, those involved in analysis of variance have been derived 
on the basis of mathematical reasoning. That reasoning starts with postu- 
lates or assumptions. If those assumptions are satisfied within certain 
limits of tolerance, the results in terms of F ratios may be interpreted as 
described in this chapter. If those assumptions are not sufficiently approxi- 
mated, there is considerable risk that the conclusions may be faulty. 

Although assumptions have been mentioned from time to time, the four 
assumptions generally to be met are repeated here for emphasis: 
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1. The contributions to variance in the total sample must be additive. 
The summative idea is illustrated in Table 12.7, in which we stripped off 
one by one the three sources of variance. The additive nature of squared 
variations is dependent to some extent upon other assumptions to follow. 

2. The observations within sets must be mutually independent. The 
“Jaws of chance" must be allowed to operate in an unrestricted manner. 
"The occurrence of a certain deviation in one observation must be in no way 
dependent upon any other deviation. This is, of course, a property of 
random sampling. The random sampling occurs within sets. The inten- 
tional variations of experimental conditions may produce systematic varia- 
tions between sets. Whether or not such systematic variations do occur 
is the thing to be tested. 

3. The variances within experimentally homogeneous sets must be approxi- 
mately equal. By "experimentally homogeneous" is meant observations 
under one special set of experimental conditions. The “within-sets” 
variance is commonly the denominator of the F ratio. It therefore carries 
a heavy burden, especially if there are more than one Ё to be computed 
from the same data, ‘This variance is used as a single estimate of the popula- 
tion variance, and all contributors to it should tell a similar story. If there 
are doubts about the homogeneity of variances in the sets, Bartlett’s test 
should be applied. 

4. The variations within experimentally homogeneous sets should be 
from normally distributed populations. 

If we follow the practice of free and random sampling within sets and if 
we use a good metric scale, we can ordinarily feel assurance that the F test 
will not be invalidated. It must be remembered, however, that the con- 
ditions of sampling are never ideal. F tests are therefore only approximate. 
Under somewhat doubtful circumstances, an F that proves to be significant 
at the .05 level may be actually significant anywhere from the .04 to the 
:07 level; one significant at the .01 level may be actually significant between 
the .005 and .02 levels; and so on. If anything, the significance is likely to 
be lower than that indicated by the result, when assumptions are not well 
satisfied. 

General Uses and Limitations of Analysis of Variance. There is insuffi- 
cient space here to do more than give this introduction to the analysis-of- 
variance methods. There are many and varied applications of these basic 
cases—the separation of sums of squares among a few sets of data into the 
“within” and “between” components—generally in the social sciences. 

Conditions affecting sets of measurements often vary in a number of ways 
in the same experiment. This complicates the analysis-of-variance solution 
in various ways. We have problems of three-way classification, four-way 


1 Cochran, W. G. Some consequences when the assumptions for the analysis of variance 
are not satisfied. Biometrics, 1947, 8, 22-28. 
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classification, and so on. We have triple and quadruple interactions. 
There are problems in which the sets of data are not independent, involving 
correlated means. There is a technique for analysis of covariance. Covari- 
ance and correlation are closely related, as will be seen in some of the later 
chapters. For further descriptions of how to adapt analysis of variance to 
various kinds of experimental problems, the reader is referred to books that 
treat the subject at much greater length.’ 

Not the least of the merits of analysis of variance is the rather strict set 
of requirements it imposes in the designing of experiments. Experimental 
designs have been observed, particularly in psychophysics, for a long time. 

- But they have not been generally so consciously considered or so well planned 
as when the experimenter knows that analysis of variance is to be used. 
Discussions of experimental designs will be found in extensive treatments 
elsewhere.* 

Exercises » 

The values in Data 12А represent measurements of the lower threshold for hearing the 
pitch of tones. The observer was the same throughout, Each trial was composed of four 
observations. Four trials were given on two different days. 

1. Using the four sets of observations made on the first day, apply an P test to determine 
whether there were systematic changes in threshold level from trial to trial, Estimate 
variances by using deviations from means, Interpret your results satistically and 
psychologically. 


Dara 124. DATA IN A Two-way CLASSIFICATION 


1 Edwards, A.L. Experimental Design in Psychological Research. New York: Rinehart, 
1950; Johnson, P. O. Statistical Methods in Research. New York: Prentice-Hall, 1949; 
Lindquist, E. F. Statistical Analysis in Educational Research, Boston: Houghton Mifin, 
1940, 

2 Baxter, B. Problems in the planning of psychological experiments. Amer. J. Psychol., 
1941, 54, 270-280; Kogan, L. S. Variance designs in psychological research. Psychol. 
Bull., 1953, 50, 1-40; Cochran, W. G., and Cox, б. М. Experimental Designs. New York: 
Wiley, 1950. 


284 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION leu. 12 


2. Make a similar F test of the data derived from the second day's observations, using 
the formulas for origingl measurements. Make any ¢ tests that seem called for. 

3. Treat the entire table of data as a two-way classification problem. Make F tests to 
determine the significance of the three special sources of variance (between trials, between 
days, and interaction of trials and days). Interpret your results. 

4, Take out each source of variance in Data 12A step by step, as was demonstrated in 
Table 12.7, 

5. Compute an F ratio for the analysis of Data 12B. 


Data 12B. RATINGS OF Seven INDIVIDUALS BY THREE RATERS IN A 
PARTICULAR TRAIT 


Raters 
Ratees 


* адве 


MO GR oS 
892229 
л сл О Ж б сл н 
а ш са а бл бл бл 


6. Compute an intraclass correlation between raters and between averages of ratings in 
Data 12. 
Answers 


1. nZd* = 62.76; Xx!, = 125.0; F = 2.01 (df = 3, 12). 
2. nZd* = 34.75; £x’, = 31.00; F = 4.49 (df = 3, 12); (М, — М») = 2.19; 
t (Ms — Мад) = 1.64; (at 12 df a difference of 3.97 is significant at the .05 level); 

сам (from the within variance) = 1.824, 

3. Zi = 325.5; Уй? = 169.5; Ed’, = 72.0; Zd*, = 90.5; Ed’ = 7.0; Хх, = 156.0; 
F (between rows) = 11.08 (df = 1, 24); F (between columns) = 4.64 (df = 3, 24); F (inter- 
action) = 0.36 (df = 3, 24). 
4. Means of columns and rows constitute the necessary check. 

5. Xd', = 61.14; Zd% = 3.71; Ex% = 79.14; N, = 14.29; F (rows) = 8.56 (df = 
6, 12); F (columns) = 1.56 (df = 2, 12). 

(0 6. fe = 67; ғы = .86. 


e 


CHAPTER 13 


SPECIAL CORRELATION METHODS AND PROBLEMS 


Pearson's product-moment coefficient is the standard index of the amount 
of correlation between two things, and we employ it whenever it is possible 
and convenient to do so. But there are data to which this kind of correlation 
method cannot be applied, and there are instances in which it can be applied 
but in which, for practical purposes, other procedures are more expedient. 
The Pearson coefficient cannot or should not be computed, for example, 
unless the two variables X and Y are measured on continuous metric scales 
and unless the regressions are linear (see Chap. 15). Many of our data 
are in terms of frequencies of cases having attributes; they are on variables 
of a “qualitative” rather than a quantitative sort. Less often, two con- 
tinuously measured variables bear to one another a relationship that is 
curved rather than in the form of a straight line. In this chapter will be 
described some procedures that take care of these irregular situations and 
of other situations where short-cut methods are better used to estimate 
a Pearson r. 

Even when we can apply the product-moment correlation method, how- 
ever, there are many circumstances which may give rise to a somewhat 
different estimate of correlation than is typical or to one that does not 
apply to the population in which we are interested. Samples may be 
heterogeneous or they may be restricted in variability or they may be forced 
into a smaller number of categories than we need for good estimates of 
correlation, estimates free from errors of grouping. These, and other com- 
mon irregularities in the sampling situation or in the data, call for special 
corrective steps and for special interpretive action. It is impossible to 
anticipate all the peculiarities of data that the reader may encounter, but 
the more common exceptions to ideal correlation conditions will be touched 


upon. 
SPEARMAN'S RANK-DIFFERENCE CORRELATION METHOD 
When samples are small, a common procedure applied to regular data in 
place of the product-moment method is the rank-difference method of 
Spearman. It is conveniently applied as a quick substitute when the num- 


ber of pairs, or JV, is less than 30. It is even more conveniently applied 
285 
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when the data are already in terms of rank orders rather than in terms of 
measurements. 

The Computation of a Spearman Rho. If we have data in terms of 
measurements or scores, it is first necessary to translate them into rank 
order. The procedure will be demonstrated by means of the data in Table 
13.1. There we have 15 pairs of scores for 15 individuals who responded 


TABLE 13.1. A RANK-DIPFERENCE CORRELATION BETWEEN HUMOR SCORES IN 
REACTIONS TO CARTOONS AND TO LIMERICKS 


Cartoon | Limerick Ra D p: 
score 

47 8 3 9.00 
71 6 2 4.00 
52 5 4 16.00 
48 14 4 16,00 
35 15 0.5 0.25 
35 12 2.5 6.25 
41 8 4.5 20.25 
82 3 2 4.00 
72 1 2 4.00 
56 4 3 9.00 
59 10 4 16.00 
73 2 0 0.00 
60 13 8 64.00 
55 8 0 0.00 
41 e 11 1.5 2.25 
171.00 

хр? 


to sets of cartoons апа limericks by judging their humor values, each on а 
point scale. The score in each case is the sum of the points each individual 
assigned to the set. We could correlate these scores in the usual manner, 
described in Chap. 8. The rank-difference method will be found shorter. 
The following steps are necessary: 


Step 1. Rank the individuals from the highest to the lowest in the first 
variable (here it is “cartoon score"), and call these ranks Ri. The 
highest score receives the rank of 1 (which is arbitrary; we might 
have called it 15), the next highest 2, etc. The only difficulty 
encountered is when we find ties. For example, in Table 13.1, 
two individuals have scores of 41. One of them comes at rank 12 
and the other at rank 13. We do not know which, if either, is better, 
yet we must fill these two rank positions; therefore we take the 
average of the tied ranks and call them both 12.5. We make cer- 
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tain that the next ranking scorer is called 14, unless he also is tied. 
We find that he is tied with another who has a score of 35. We 
treat these two in a similar manner, and so they become each 14.5. 
If the lowest person is not tied with others, the last rank should be 
equal to M (in this case, 15). This serves as a check as to accuracy 
of ranking, though, of course, it will not detect inversions in rank 
order somewhere along the line. It merely shows whether any rank 
has been repeated, whether any individuals have been overlooked, 
or whether ties have somewhere not been properly treated. 

Step 2. Rank the second list of measurements in a similar manner, and 
call them Rs. In this problem, there are three scores of 75 for the 
individuals occupying places 7, 8, and 9. We call them all 8, leaving 
out of the list 7 and 9. This treats the three alike, as they should 
be, yet gives us a full set of 15 ranks. 

Step 3. For every pair of ranks (for each individual), determine the differ- 
ence in ranks. The smaller one can be subtracted from the larger 
one in each case, with no attention being paid to algebraic signs, 
for they are all going to be squared anyway. 

Step 4. Square each difference to find De. 

Step 5. Sum the squares of the differences (see the last column of Table 13.1) 
to find ED?. The sum in our illustrative problem is 171.00. 

Step 6. Compute the coefficient p (Greek letter rho) by means of the formula 


бхр? 


p=1- Nv? I) (Rank-difference coeflicient of correlation) (13.1) 


where XD? = sum of the squared differences between ranks and № = number 
of pairs of measurements. 
In this problem 
6 X 171 
n 
= .695— 


By this procedure, then, the estimate of the amount of correlation between 
the two sets of scores is .69. How shall we interpret this correlation, as 
compared with a Pearson r? 

Interpretation of a Rho Coefficient. The rank-difference coefficient is 
practically equivalent to the Pearson r, numerically. There is a conversion 
formula by which the corresponding Pearson 7 can be estimated from rho. 
But this formula assumes large samples, which is precisely what we do not 
have when we compute rho. Results from the formula show, however, that 
on the average r is slightly greater than p and that the maximum difference, 
by the formula, is approximately .02, when they are both near .50. We 
may therefore treat an obtained rho as an approximation to 7. 
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Significance of a Rho Coefficient. There is no generally accepted formula 
for estimating the standard error of rho. We cannot, therefore, determine 
confidence limits. We can test the hypothesis that the population correla- 
tion is zero, in two ways. If N is as great as 25, the standard error of a zero 
rank-order correlation coefficient is given by the formula 


= мла? эи - (Standard error of rho when the 
e vN-—1 population value is zero) (13.2) 


Under these conditions the sampling distribution may be assumed to be 
normal, and we may estimate a Z ratio by the formula 


2, NN 1 (13.3) 


When M is less than 25, the interpretation is best made by the aid of 
Table L, in which are given rho coefficients significant at the .05 and .01 
levels of confidence. The rho of .69 obtained in the illustrative problem 
where N = 15 would be regarded as significant beyond the .01 level. It is 
thus highly unlikely that there is no correlation between the “cartoon” 
and “limerick” scores, but how close to .69 the population correlation is we 
cannot say. 

A Brief Evaluation of the Rank-difference Correlation. Although there 
is no good estimate of the standard error of a rho coefficient, there is reason 
to believe that rho is almost as reliable as a Pearson r of the same size in a 
sample of the same size. Consequently, rho is almost as good an estimation 
of correlation as the Pearson r. If rho is used as a convenient estimate of 
r, the usual assumptions of linear regression and homoscedasticity (which 
would apply to good measurements of X and Y. not necessarily to those 
obtained or to the ranks) should be tenable. 

In view of the fact that rho will ordinarily be computed only in small 
samples, in which low correlations cannot be accurately determined, the 
chief use of rho, under these circumstances, would be to test the hypothesis 
of zero correlation. When correlations are high, we may have almost as 
much confidence in rho for indicating the amount of correlation as we have 
in r applied to samples of the same size. 

Kendall has developed a ranking-method correlation coefficient called 7 
(tau), which rests on no particular assumptions.’ It has numerous applica- 
tions, including the testing of hypotheses, but bears no direct relation to the 
traditional family of product-moment correlations. 


THE CORRELATION Ratio 


The correlation ratio is a very general index of correlation particularly 
adapted to data in which a curved regression prevails. Among test scores, 


Kendall. M. G. Rank Correlation Methods. London: Griffin, 1948. 


сн. 13] SPECIAL CORRELATION METHODS AND PROBLEMS 289 


linear relationships are apparently the almost universal type of regression. 
Normality, or near normality, in both distributions correlated is almost 
sufficient in itself to promote linearity. Outside the sphere of psychological 
and educational tests, however, or when nontest variables are correlated 
with test scores, we sometimes encounter curved trends in the scatter dia- 
gram. The means of the columns do not progressively increase as we go 
up the X scale. They may increase slowly at first, then rapidly later; or 
they may increase to a maximum in the center and then decrease; or other 
systematic divergencies from linearity may be apparent. 

Nonlinear Regressions. A common instance of nonlinear relationship 
is found when we correlate performance scores with chronological age. 
Typically, goodness of performance, as measured, increases most rapidly 
from ages five to ten and thereafter shows a slackening in upward trend 
through the teens. If we follow the progression still further, we find typi- 
cally a maximal performance somewhere in the twenties, with slow decline 
to the forties and an increasing rate of decline thereafter. If we included 
all ages from five to seventy-five in our correlation study and if we com- 
puted the usual Pearson r between age and scores, the r would probably 
prove to be near zero. On such a correlation diagram, the scattering of 
points would be considerably dispersed from any straight line that we 
might try to draw through the data, slanting upward or slanting downward. 
Inspection would show, nevertheless, a law of relationship between age 
and performance but a relationship that takes into account the waxing 
and waning of ability both within the span of ages studied. 

We might break the chart in two and treat by themselves the years 
during which there is improvement and by themselves the years during 
which there is decline. We should be able to compute a positive correlation 
for the earlier span and a negative correlation for the later span by assuming 
straight-line trends. But these would be of doubtful significance and cer- 
gainly would not do justice to the full strength of relationships, even within 
the two segments of life span. The reason is that the trends still deviate 
from straight lines. Curvature has been overlooked, and to that extent 
the index of correlation is perhaps markedly underestimated. 

Two Regression Lines and Two Correlation Ratios. The scatter dia- 
gram in Fig. 13.1 represents a sample of relationship between perform- 
ance score in a form-board test and chronological age between five and 
fifteen years inclusive. Here the score is time required for completion; 
hence a high number indicates a poor performance, and the trend is down- 
ward. But the relationship obviously drops most rapidly during the first 
3 years and settles down to slight changes from year to year during the last 
3 years. Two regression lines are drawn in the diagram to show more 
clearly the trends. The regression of test score on age is shown by the 
solid line that is drawn connecting the circlets, which are plotted at the 
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means of the columns. The regression of age upon test score is shown by 
the dotted line, and the means of the rows, by the x's. 

Just as we find two regression lines (for an imperfect correlation) in Chap. 
15, where linear regressions are involved, so here we find two regression 
curves, differing in shape as well as in slope. We have accordingly two cor- 
relation ratios, or eta coefficients, one for each of the regressions, and they 
will not necessarily be the same in value. This result differs from that in 
the case of linear correlation, where ry» = >ш. 


X: Chronological Age in Years А 
БОН ПАЙ Зв: о Ma ihre Vis а уау. ту": fy" 


60-64 i|] e| se 
55-59 3|*5| ьш 75 
B ж-ы 3| 44] 12] ав 
45-49 s|s3| 9 27 

i 40-44 8 +2] 16) 22 
i 35-39 КК al s 
s 30-34 1330 o o 
A 25-29 21 1 [210 21 
E 20-24 14 | -2 | -28| 56 
15-19 за | -3 |-102| 306 
10-14 26 | -4 |-104| 416 
5-9 16 | -5 | -80| 400 

h| 9 з 1 18 a 18 19 12 15 10 [150 25914250 


Fre. 13.1 A scatter diagram for a correlation- ratio problem. 


The two correlations ratios are given by the formulas 


. = A (Correlation ratio for the regression of Ү on X) (13.4a) 
ВЕ; 

and ty = = (Same, for regression of X on Y) 13.40) 
fz 


where oy = standard deviation of the values (Y^) predicted from X 

= standard deviation of the X values predicted from Y 

7, and gs = standard deviations of the two total distributions 

The manner in which c, and cz are determined will be explained next. 

The Computation of a Correlation Ratio. In a prediction problem of 
this sort, the best prediction of Y for any column is the mean of the Y's in 
that column. This prediction will have the smallest sum of squared devia- 
tions from the observed Y’s in that column. So Y’ for each column is the 
mean of that column. We therefore first compute the means of the columns. 
These are listed in column 3 in Table 13.2. Now if there were no correla- 
tion, no law of relationship between Y and X, these У’ values would lie 


a 
— 
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along the level of the mean of all the Y values, which in this problem is 23.0. 
No predictions could then be made on the basis of knowledge of X values. 
For every column with its X value (midpoint), the most probable correspond- 
ing Y would be 23.0 and our margin of error would be indicated by oy. It 
would be as large as if we had no knowledge of X for each individual (see 
Chap. 15 for a more complete discussion of this point). 

The more the means of the columns deviate from the mean of all the 
Y’s, the more accurate our predictions are. We are therefore interested in 
how far the Y^ values do deviate from 23.0 in this problem. Those dis- 
crepancies (Y’ — My) are given in column 4 of Table 13.2. As usual, we 
square the discrepancies or deviations and find their mean as an indicator 
of how great is their average. The squared discrepancies (Y' — M,)* are 
given in column 5 of Table 13.2. But before finding a mean of the squared 


Taste 13.2. THE COMPUTATION OF A CORRELATION RATIO FOR THE REGRESSION 
or TrwE SCORE ON CHRONOLOGICAL AGE 


(1) (2) (3) (4) (5) (6) 
; 
Fe ne 1 Y'-M, |(Y'- Mj n( Y! — Му)? 
14 10 | 11.0 —12.0 144.00 1,440.00 
13 15 | 14.0 — 9.0 81.00 1,215.00 
12 12 | 14.5 EU 72.28 867.00 
11 19 | 16.0 S740 49.00 913.00 
10 18 | 18.1 — 4.9 24.01 432.18 
9 21 | 20.8 2:2 4.84 101.64 
8 18 | 25.1 +21 4.41 19.38 
7 15 | 31.3 + 8.3 68.89 1,033.35 
6 13 | 40.5 +17.5 306.25 3,981.25 
5 9 | 49.8 +26.8 718.24 6,464.15 
бшш. 1501]; R 16,544.96 / Му)? 
110.2997 о?у 
10.50 oy 


discrepancies, we weight each one for a column by the number of cases in 
that column. The weighed, squared discrepancy for each column will be 
found in the last column of Table 13.2. Then they are summed, and we 
divide by V, which is 150 in this problem, to find c?y', which is 110.2997. 
The square root of this is 10.50, which is the ø of the discrepancies. 
Remember that these are not the discrepancies of the observed points 
from the predicted Y values, for the larger these are, the lower our correla- 
tion. We are here interested in the size of discrepancies between predicted 
Y values and the mean of all Y values, and the larger these are, the higher 
our correlation. When the correlation is perfect, oy is as large as oy, for 
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then the ratio %% equals 1.00. When cy = 0, the ratio equals zero. In 
this problem, c, = 12.535. The correlation ratio is therefore 


10.50 


"w 12.535 838 


The steps in computing a correlation ratio may be summarized as follows. 


Step 1. Determine the mean of all the V values and also their standard 
deviation. 

Step 2. Determine the means of the columns (Y^). 

Step 3. Determine the discrepancies between Y" and M.. 

Step 4. Square the discrepancies. 

Step 5. Multiply each squared discrepancy by the number of the cases 
in the column (ue). 

Step 6. Sum the weighted, squared discrepancies, and divide by N. This 
gives o*y. From this, find cy. 

Step 7. Solve the ratio ey/ce,, which is ту. 


Remember that, for finding na we are dealing with rows rather than columns, 
and so the steps will be the same except for the substitution of the word 
row for the word column in what follows and the substitution of X for Y. 

The Standard Error of a Correlation Ratio. The reliability of a cor- 
relation ratio, like the reliability of r, is given by its standard error, and 
this is derived by a similar formula 

o,= UNT (Standard error of a correlation ratio) (13.5) 

The standard error of the eta coefficient that we have just obtained is 
025. The amount of correlation is therefore rather close to the population 
correlation. 

The Standard Error of Estimate in a Nonlinear Regression. The standard 
error of estimate here can be computed as from a Pearson r [see formulas 
(15.16а) and (15.165)], but it can also be obtained from the knowledge that 


Pyu + о?у = 0%, 
That is, the total variance in the Y distribution is made up of two com- 
ponents, the, variance predictable from X (this is o%y) and the variance 
not predictable from X (which is 6%). Transposing, we have 


Oy, = 03, — о?у 

In solving for an eta coefficient, we must know both the terms on the 
right of this equation. For our illustrative problem, they are 157.1262 and 
110.2997, respectively. The difference is 46.8265, which is the nonpredicted 


variance. The square root of this, which is 6.84, gives us cyz The standard 
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error of estimate tells us how much dispersion there is of the obtained values 
(Y values in this case) around the predicted values (Y^ values in this case). 
The figure 6.84 tells us that two-thirds of the time scores in the Form Board 
test may be expected to be within 6.84 units of the predicted values, when 
the predicted values are the means of the columns of the scatter diagram. 
Such an estimate is useful, however, only when the variances within columns 
are fairly uniform. 

The Relation of the Correlation Ratio to Analysis of Variance. Those 
who have read Chap. 12 will find much that is familiar in the preceding 
paragraphs. Regarding the successive columns of data, which are really 
the result of a one-way classification on a quantitative variable, namely, 
chronological age, as sets, we have all the information we need to proceed 
with an analysis-of-variance solution (see Table 13.3). The sum 16,544.96 


TABLE 13.3. AN ANALYSIS OF VARIANCE BASED UPON STATISTICS DERIVED IN THE 
SOLUTION or A CORRELATION RATIO 


Degrees of Sums of T 
Component Тае squares Variances 
Between sets. . 9 16,544.96 1,838.33 
Within sets. 22 140 7,023.97 50.17 
Total... siege este 149 23,568.93 
1,838.33 
Е 50.17 36,6 


es — u—— 


will be recognized as the sum of squares between sets, since it is based upon 
the squared deviations of set means from the composite mean. The sum 
7,023.97 will be recognized as the sum of squares within sets. This sum is 
found most conveniently here from what we already know. It is given 
by the product Noe, which in this- problem is 150 X 46.8265 = 7,023.97. 
The sum of the two sums of squares makes up the total sum of squares for 
the composite sample in variable Y. All we need next are the degrees of 
freedom. For the between variance there are 9 (the number of sets minus 
1) For the within variance there are 140 (N minus the number of sets). 
The two estimates of the population variance are given in Table 13.3, also 
the F ratio, which is 36.6. Reference to Table F (Appendix B) shows that 
it is well above the F required for significance at the .01 level of confidence, 
which is about 2.5. 

The relationship pointed out here is more of academic interest than of 
practical interest, for we already know that the eta coefficient was so high 
that there was little doubt of a law of relationship existing between chrono- 
logical age and test score. Furthermore, the eta coefficient tells us a fact, 
namely, concerning the degree of relationship, which an F ratio does not 
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convey. When the eta is near the lower margin of significance and a more 
rigorous test of significance is required, when a decision is to be made as to 
whether or not there is any genuine relationship at all, then the F test has 
its advantages. Even then, however, an F test is not recommended unless 
Y is a monotonic (continuously increasing or decreasing) function of X. 

A Test of Linearity of Regression. Often the curvature in regression is 
so slight that we do not know but that it is merely a chance deviation from 
linearity. We therefore want some statistical test to show whether or not 
the curvature is probably real. Several tests of nonlinearity have been 
proposed. The test currently best accepted is an F test based upon an 
analysis-of-variance approach. The computation of F in this instance is 
simple, requiring only the knowledge of eta and the Pearson r for the same 
scatter plot, and the degrees of freedom. The formula is 


(m — rN — k) 


(—36-—2 (F test of linearity) (13.6) 


F= 
E 
where = number of columns (or rows). For the problem in recent para- 
graphs, the Pearson r was found to be .763. By formula (13.6) we have 


ae (.702244 — .582169)(150 — 10) 
(1 — .702244)(10 — 2) 
= 7.06 


In interpreting this F, the degrees of freedom are (k — 2) and (N — 4). 
Reference to Table F shows that the obtained F is significant well beyond 
the .01 level. Thus, the difference between 72у and ½% is so great as to 
leave little doubt of nonlinearity, 

The hypothesis tested here is that the regression of Y on X is linear. 
In more exact terms, the hypothesis requires that the means of the columns 
all lie exactly on a straight line whose slope is determined by the Pearson r. 
Now if the actual form of regression were linear, sampling errors would 
cause the means of columns to deviate slightly from the best-fitting straight 
line. The sampling distribution is of these deviations of the actual means 
of the columns, the Y’ values from the regression line. These deviations 
are ordinarily sufficient to make the eta coefficient larger than the Pearson 
r computed from the scatter diagram. The question is whether the devia- 
tions are large enough to suggest that there is something over and above 
these chance deviations involved. That is what the F test here is supposed 
to tell us. The Р test should be applied to this particular use only when V 
exceeds k considerably. 

An Evaluation of the Correlation Ratio. The chief advantage and use 
of the eta coefficient has been indicated and illustrated—to determine the 
closeness of relationship between two variables when the regression is 
clearly nonlinear. Although very few nonlinear regressions have been 
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found in the correlation of measures of ability, it is likely that there are 
many more such relationships in psychology and education than has been 
realized. This is true if we broaden our conception of the correlation 
problem considerably by saying that an index of correlation (index is a 
more inclusive term than coefficient) is a measure of the goodness of fit of 
obtained data to a regression line, whether it be straight or curved. The 
Pearson r indicates the goodness of fit of observed points to a straight line. 
Other indices, including eta, show the goodness of fit of data to other functions. 

Correlation Coefficients as Indices of Goodness of Fit. This broadening of 
the concept of correlation would bring into consideration curves of learn- 
ing and retention and many others. The eta coefficient assumes no par- 
ticular type of functional relationship between Y and X. The type of rela- 
tionship is defined by the actual, unsmoothed trend of the means of the 
columns (or rows). In this fact are both strength and weakness. Allowing 
the curvature of the regression to be as complex as the ups and downs in 
obtained class means make it, we find in eta the maximum size of correla- 
tion index for any set of data. 

We might assume some kind of mathematical function for the data repre- 
sented in Fig. 13.1—а hyperbola or parabola, a logarithmic function or 
some other. The goodness of fit, as indicated by a correlation index, would 
probably not be so high for any of these functions as the eta coefficient 
indicates, Because the eta coefficient does allow the regression curve to 
follow the means of the columns, a certain amount of error or purely sampling 
variance undoubtedly gets into the deviations of column means from the 
general mean of the Y's, and hence the eta is a somewhat inflated figure. 
When the actual regression is linear, the difference between eta and r com- 
puted for the same data tells us about how much inflation has occurred. 
When the regression is nonlinear, we have less ready evidence as to how 
much inflation there is. We should therefore discount any eta a little, 
particularly if the means of sets do not follow a smooth trend rather well. 
The smaller the sample, the more irregular the trend of the set means is 
likely to be, and therefore the greater the proportion of inflation in eta. 

Examples of Nonlinear Regressions. In addition to the functional rela- 
tionships involved in learning and other phenomena, it is likely that when 
more is known about human traits that are not abilities--temperament, 
interests, attitudes, and the like—and their interrelations, we shall find 
many more examples of nonlinear regression. In the validation of test 
scores against vocational or other criteria of adjustment, more and more 
of such examples are coming to light. It has been known for some time that 
high "intelligence" may be just as bad prognostically as low intelligence“ 
in connection with proficiency in routine and repetitive job assignments. 
This result will probably be found more general than has been supposed. 
The reason it has not been more widely recognized before is that relatively 
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short ranges of ability have been related to proficiency criteria. If the total 
range, from lowest to the very highest, is studied in relation to proficiency 
indices on various kinds of jobs (except those requiring highest abilities) 
we may find the optimal ability to be somewhat short of the top in most 
cases. This definitely means nonlinear regressions. 

A number of instances have been called to the writer’s attention in which 
Scores on temperament tests bore a relation to rated proficiency in such a 
way that the optimal position on the trait score was barely above average, 
The application of the Pearson r method sometimes shows a near zero cor- 
relation in “such instances whereas an eta coefficient might be as high as 
-30 or even .50. The straight line, in other words, was a very poor fit to 
the regression of the data. This should stress the importance of plotting 
scatter diagrams more frequently than is ordinarily done; otherwise impor- 
tant nonlinear regressions may be overlooked. Tt is possible that many a 
zero Pearson r reported in the literature conceals a significant nonlinear 
relationship. 

The Algebraic Sign of Eta. Some writers regard it as a weakness of eta 
that its algebraic sign is always positive. The algebraic sign of r is mean- 
ingful in that it shows whether the general trend is upward or downward. 
In defense of eta it may be said that it tells us the thing we are most interested 
in, the goodness of fit or closeness of relationship between two things. If 
the over-all trend is either upward or downward we can readily perceive 
that by inspection of the scatter plot, and we can attach whatever sign is 
appropriate if we wish to do so. Some curved regressions, for example, 
U-shaped or an inverted U-shaped type, may yield a significant eta without 
any general trend away from the horizontal. In this case no sign is mean- 
ingful for eta. 

Dependence of Eta upon the Number of Categories. A more serious weak- 
ness of eta is that its size depends upon the number of columns (or rows). 
The minimum number of classes that would show any curvature at all is 
three, but three might give a much-smoothed and distorted view of the real 
relationship. With too small a number of classes, therefore, we run the 
chance of obtaining an estimate of correlation that is too small. On the 
other hand, as we increase the number of classes, we make the means of 
the classes less stable, and, as they fluctuate more, chance errors become 
more important in inflating eta. The limiting case would be classes so 
small that there was only one observation per class (assuming no duplicate 
measures on X), in which case the variance in the columns would be just 
as great as the over-all variance in У, and eta would equal 1.00. 

Methods for correcting eta for number of classes have been proposed, 
but none can be recommended. The best rule would be to keep the classes 
large enough so that means of classes are fairly stable and fall rather smoothly 
into line in the scatter plot and yet to have enough classes to bring out 
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clearly enough the shape of the regression. The size of sample has some 
bearing on this. The larger the sample, the larger the number of classes 
that can be tolerated. Very small samples would be unsuitable for the 
computation of eta at all. With large samples (100 and above) it is suggested 
that the number of classes range between six and twelve.* 

The Use of Mathematical Functions. Better than the correlation-ratio 
approach, in research studies, would be an effort to establish the form of a 
regression as some mathematical function and then test the goodness of fit 
of data to that function by methods which we cannot go into here. There 
are other texts that treat this topic in some detail. 


Tur BisERIAL COEFFICIENT OF CORRELATION 


The biserial r is especially designed for the situation in which both of 
the variables correlated are really continuously measurable but one of the 
two is for some reason reduced to two categories. This reduction to two 
categories may be a consequence of the only way in which the data can 
be obtained, as, for example, when one variable is whetheff or not a stu- 
dent passes or fails to pass a certain criterion of success. We can well 
assume a continuum along which individuals differ with respect to the 
ability required to pass this criterion. Those having a degree of ability 
above a certain crucial point do pass it, and those having a degree of ability 
below that crucial point fail to pass. 

Let us assume that the criterion is graduation from pilot training. Al- 
though not all graduates are equal in achievement nor are all eliminees, 
all we know is whether each person belongs to one category or the other. 
It is as if our grouping were so coarse in this variable as to be confined to 
two class intervals rather than a dozen or so. If we are prepared to justify 
normality of distribution in this dichotomous variable, we have a formula 
by which a coefficient of correlation can be computed. 

Computation of a Biserial т. The principle upon which the formula for a 
biserial ғ is based is that with zero correlation there would be no difference 
between means, and the larger the difference between means, the larger the 
correlation. The general formula for biserial r is 


n= 2 Ms x a (Biserial coefficient of correlation) (13.7) 
a 
where M, = mean of X values for the higher group in the dichotomous 
variable, the one having more of the ability in which the 
sample is divided into two subgroups 


1 For small samples, a statistic known as epsilon (a correlation ratio without bias) is 
recommended. See Peters, C. C., and Van Voorhis, W. R. Statistical Procedures and 
Their Mathematical Bases. New York: McGraw-Hill, 1940. Рр. 319f. 

2 Deming, W. E. Statistical Adjustment of Data. New York: Wiley, 1946; Lewis, D. 
Quantitative Methods in Psychology. Iowa City: The author, 1949. 
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M, — mean of X values for the lower group 
Ф = proportion of the cases in the higher group 
q = proportion of the cases in the lower group 
y = ordinate of the normal distribution curve with surface equal 
to 1.00, at the point of division between segments contain- 
ing р and g proportions of the cases (see Fig. 13.2) 
o = standard deviation of the total sample in the continuously 
measured variable, X 
Table 13.4 presents typical data for computing a biserial correlation. 
The passing group were distributed as shown; also, the failing group. The 
proportions passing and failing are .65 and .35, respectively.! The y ordinate 


Below average ability <—O—>Above average ability 
Fro. 13.2 А normal distribution of the cases along the scale of ability to pass the course of 
training. The area to the right of the ordinate shown represents the 65 per cent who 
graduated, and the area to the left represents the 35 per cent who failed to graduate. 


TABLE 13.4. DISTRIBUTION or SCORES FOR Two GROUPS or SrUDENTS— THOSE 
Passinc AND THose FAILING—ALsO A COMBINED DISTRIBUTION 


Passing students. .. 
Failing students 


(from Table C) is 3704. The distribution of the total group is assumed 
to be as indicated in Fig. 13.2. The computation of the biserial 7 proceeds 
as follows: 


98.27 — 83.64 ,, (.65)(.35) _ 


1768 X 3704 308 


Table G (Appendix B) is designed, in part, to supply several of the con- 
stants needed in the computation of a biserial r, either by formula (13.7) or by 
formula (13.9), and the computation of its standard error. For given values 
of p, Table G supplies the corresponding values of 4/9, b/ y, and »/pq/¥- 

It is good practice to compute f and g each to three significant digits. 


cu. 13] SPECIAL CORRELATION METHODS AND PROBLEMS 299 


The Standard Error of ғ. The standard error of а biserial r is estimated 
by the formula 
MP 
n, = TER (Standard error of a biserial r) (13.8) 
where the symbols have already been defined above. 
In this problem 


4770 
py ak "io: . 073 
4/200 


This standard error may be interpreted as usual, and we find that the 
obtained r, is so large as undoubtedly not to be arising from an uncorre- 
lated population. 


Alternative Formula for Biserial r. In many situations, a more con- 
venient formula for the biserial ғ їз! е 
„= MK * £ (Alternative formula for a biserial r) (13.9) 
7 


where the only new symbol is Mi, the mean of the total sample. The greater 
convenience of this formula over the other is that formula (13.9) gives us 
one less distribution to deal with. A good type of work sheet for solution 
by this formula is shown in Table 13.5. It is convenient to use the same 


Tape 13.5. SOLUTION OF MEANS AND STANDARD DEVIATION NECESSARY YOR THE 
COMPUTATION OF A BISERIAL r 


My = 4.377 My = —135 а 10 990 — .135* 
iM» = +377 iMy = —1.35 = 10 4/3.1268 
M, = 98.27 M. = 93.15 = 17.68 
1 Dunlap, J. W. Note on computation of biserial correlations in item evaluation. 
Psychometrika, 1936, 1, 51-60. 
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zero point for both the component distribution and for the total dis- 
tribution. By this procedure, the biserial r and its о, come out the same, 
as we have already seen. 

An Evaluation of the Biserial r. Since the biserial coefficient of corre- 
lation is a product-moment v and is designed to be a good estimate of the 
Pearson 7, the same requirements as for the latter must be satisfied —linear 
regression and homoscedasticity—plus the unique requirement that the 
distribution of the values on the dichotomous variable, when continuously. 
measured, shall be normal. This requirement of normality applies to the 
form of population distribution. Even if the sample distribution is not 
normal, the population distribution may still be normal. 

The use of the quantities р, q, and y in formulas (13.7) and (13.9) directly 
implies the normal distribution of the dichotomized variable. Departures 
from normality, if marked, will often lead to very erroneous estimates of 
correlation. With bimodal distributions, for example, it is possible that r ` 
will prove to exceed 1.0. Bimodal and other nonnormal distributions are 
most likely to occur in heterogeneous samples—for example, in variables in 
which there is a significant sex difference and both sexes are included in a 
sample. 

When to Dichotomize Distributions. The biserial 7 is very useful, in fact 
it is sometimes essential, and when properly used is a very good substitute 
for the Pearson r. There are instances in which the Y variable has been 
continuously measured, but there are irregularities that preclude computing 
a good estimate of the Pearson r. In such cases the biserial тау be brought 
into service. One example of this would be a truncated distribution; another 
would be when there are very few categories for the Y variable and it is 
doubtful whether they are equidistant on a metric scale; another would 
be in the case of a badly skewed distribution in I values owing to a defective 
measuring instrument. 

Before computing ть, we would, of course, need to dichotomize each У 
distribution. In this we would have some choice, and it would be well to 
make the division point as near the median as possible. The reason for 
this will be made clear in the next paragraph. In all these peculiar instances, 
however, we are not relieved of the responsibility for defending the assump- 
tion of the normal distribution of Y. It may seem contradictory to suggest 
that when the Y distribution is skewed we resort to the biserial 7, but note 
that it is the sample distribution that is skewed and it is the population dis- 
tribution that must be assumed to be normal. 

Biserial r Is Less Reliable Than the Pearson ғ. Whenever there is a real 
choice of computing a Pearson r versus a biserial r, however, one should 
favor the former, unless the sample is very large and unless computation 
time is an important factor. The standard error for a biserial r is quite à 
bit larger than that for a Pearson r derived from the same sample. If we 
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compare the two formulas for the standard errors, formulas (9.12) and 
(13.8), we find that the only real difference is in the numerators. One 
reads 1 — 7? and the other reads 4/7g/y — r°» If we examine the NY 
values in Table G, we find that even when this value is smallest (and that 
is when p = q = .5), it is about 25 per cent larger than 1. When љ = .00, 
the standard error of rs is therefore at least 25 per cent larger than that for r. 
for the same size of sample. As approaches 1.0 or 0.0, the ratio (VPV 
becomes larger until, when р = .94, it is as large as 2. This is why in the 
preceding paragraph it was recommended that dichotomies have the division 
point as near the median as possible. It also suggests that we need larger 
samples for the same dependability of rs than for r and that we should hesitate 
to compute rə for very one-sided divisions of cases unless the sample is 
extremely large. This is reasonable from another point of view. Remember 
that prominent in the formula for r, is the difference between means. This 
difference is not very stable unless each mean comes from a sample of suffi- 
cient size. Even if the sample totaled 1,000 cases, if only 1 per cent of the 
cases were in one of the two categories, its mean would be based upon only 
10 cases. This is not favorable to reliable estimates based upon such a mean. 

Other Serial Correlations. Formulas have recently been developed by 
Jaspen for the correlation of a continuous variable with another variable 
that has been artificially classified in three, four, or five categories.!“ Owing 
to the rareness of the need for such formulas, space will not be taken to 
present them here. If one has more than two categories, he can always 
combine certain ones to make two and then compute л, provided, of course, 
that the necessary assumptions are satisfied. 


POINT-BISERIAL CORRELATION 


When one of the two variables in a correlation problem is a genuine 
dichotomy, the appropriate type of coefficient to use is the point-biserial 
r. Examples of genuine dichotomies are male versus female, being a farmer 
versus not being a farmer, owning a home versus not owning one, living versus 
dying, or living in Boston versus not living in Boston, and so оп. Bimodal 
or other peculiar distributions, although not representing entirely discrete 
categories, are sufficiently discontinuous to call for the point-biserial rather 
than the biserial 7. Examples of this type are color blindness versus normal 
color vision; being alcoholic versus nonalcoholic; and criminal versus 
noncriminal. 

There are other variables, though not fundamentally dichotomous and 
they may even be normally distributed, which we have to treat as if they 
were genuine dichotomies in practical operations. An outstanding example 
of this is a test item that is scored as either right or wrong. No doubt those 
who answer the item correctly are not all equally capable in the ability or 

1 Јаѕреп, N. Serial correlation. Psychometrika, 1946, 11, 23-30. 
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abilities measured by the item. A total test score would provide continuous 
gradations in ability levels, In testing practice, however, the kind of item 
described is limited to separating individuals into two groups, and only 
gross predictions can be made from responses to it. Such a variable is a 
good example to explain the basic nature of the point-biserial r. If we gave 
a "score" of +1 to each person with a correct answer and a “score” of zero 
to each person with a wrong answer, in the item variable we should have only 
two class intervals and we treat them as if they were genuine categories. A 
product-moment r could be computed with Pearson's basic formula. The 
result would be a point-biserial r. Ы 

А special formula is provided, however, which does not resemble the basic 
Pearson formula. It reads, 


"m M, = M, М s а coefficient of correla- (13.10) 
where the symbols are defined just as they were in the formula for the ordi- 
nary biserial r (formula 13.7). The only difference between this formula 
and the one for the ordinary biserial r is that the numerator contains V 
rather than pg, and the constant y is missing from the denominator. For 
the same set of data, then, the ordinary biserial r would be ~/pq/y times as 
large as ғы. In this ratio lies a feature of гуы to which we shall return soon. 

Let us apply formula (13.10) to some data on the relation of body weight 
to sex membership. In a sample of 51 sixteen-year-old high-school students, 
of whom 24 were male and 27 were female, the mean weights in kilograms 
were 67.8 and 56.6, respectively. The proportion of males is accordingly 
24/51 = .471 and д is .529. The standard deviation of the combined dis- 
tributions is 13.2. Solving with formula (13.10), 


AR Ses VIC = 42 


The correlation between sex and body weight for sixteen-year-old high- 
school students is estimated to be .42. 

Significance of a Point-biserial r. The hypothesis of zero correlation 
for the point-biserial r can be tested in two ways. Since n depends directly 
upon the difference between the means M, and M, а significant departure 
from a mean difference of zero also indicates a significant correlation. А? 
test of the difference between means can therefore be used to test the sig- 
nificance of the departure of the correlation coefficient from zero. 

A direct £ test of the correlation coefficient can also be made, but only 
for the hypothesis of a correlation of zero. The ¢ ratio can be computed 
for rj; in the same manner as for a Pearson product-moment r [see formula 
(10.3)] and the interpretation can be made with reference to Student's dis- 

1 For a derivation of this formula, also formula (13.11), see Appendix A. 
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tribution.! For the illustrative problem, in which rj = .42 and У = 51, 
t is equal to 3.24, which indicates a correlation significant beyond the .01 
level. Table D may also be used to determine whether an obtained ro; is 
significant. А 

When the population value of rp» is not zero, the mean of the ¢ distribution 
is not zero; hence the determination of confidence limits for any obtained 
туы is not a simple matter. 

Alternative Methods of Computation for rp». As for the ordinary biserial 
r, there is an alternative formula for computing rj which may be more 
convenient in many situations. It reads 


M,—M, |p (Alternative formula for the point-biserial 
тн = — РЯ * correlation coefficient) (13.11) 


Formulas for гуы making unnecessary the computation of p and g are 


М-М) VN N (Alternative f: las for th 
1 р M pita AMA or the (13.12) 


aud "I. — X fs (13.13) 


where N, and Ng are the frequencies in the two categories, 

An Evaluation of the Point-biserial r. Since the rj coefficient is not 
restricted to normal distributions in the dichotomous variable, it is much 
more generally applicable than is т. Whenever there is doubt about com- 
puting r», the point-biserial r will serve. For this reason, it should probably 
be used more than itis. Although a product-moment r, in value it is rarely 
comparable numerically with a Pearson r, or even with an ordinary biserial 
r, when computed from the same data. This is its greatest weakness as a 
descriptive statistic. Under special circumstances, to be described, it may 
be used as a basis for making an estimate of the Pearson r. 

Relation of туы to т. When properly applied, n, gives coefficients that 
are generally good approximations to Pearson r's that could be computed 
from the same data had both variables been continuously measured? Con- 
sequently, all the usual interpretations that are made of r (see Chap. 15) 
can also be made of r». 

If туы were computed from data that actually justified the use of rẹ, how- 
ever, the coefficient computed would be markedly smaller than r, obtained 
from the same data. Even if the one variable is actually continuous but 


1 Perry, N. C., and Michael, W. B. The reliability of a point-biserial coefficient of 
correlation. Psychometrika, 1954, 16, 313-325. 

2 For methods of estimating confidence limits in this situation, see Walker, H. M., and 
Lev, J., Statistical Inference. New York: Holt, 1953. P. 266; also Perry, N. C., and 
Michael, W. B. A tabulation of the fiducial limits for the point-biserial correlation coeffi- 
cient. Educ. psychol. Measmt., 1954, 14, 715-721. 
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not normally distributed, in which case we might better utilize rjj; the latter 
would give an underestimate of the amount of correlation. As was pointed 
out before, rs is Vp times as large as jy when they are computed from 
the same basic data. This ratio varies from. about 1.25 when p = .50 to 


Ratio of point-biserial r fo biserial Y 


02050 0.60 070 0.80 030 1.00 


p (Proportion in larger category) 
Fro. 13.3 Ratio of the point-biserial r to the biserial r when the difference between means 


(My = М) and the standard deviation (o/) are constant and the proportion in the larger 
category (p) varies. 


about 3.73 when p (or g) equals .99 (see Table G). Figure 13.3 shows 
graphically the ratio of rj»; to rs for various values of P. The ratio of rpi 
to ry is, of course, the reciprocal of the ratio of r, to уы; in other words, it is 
y/ jq. The diagram is designed in this manner to show maximum values 


of туы that would arise from continuous, normal distributions. In terms of 
formulas, 


Vh 


To = Р ——— Д 14 
y (Conversion of one biserial r into the other when ыз) 
y normality of distribution exists) 


пы = п (13.142) 
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It is recommended that when the dichotomous variable is normally dis- 
tributed without much doubt, be computed and so interpreted. If there 
is little doubt that the distribution is a genuine dichotomy, rj should be 
computed and so interpreted, For the doubtful situations, the rj»; should 
be computed but interpreted in the light of Fig. 13.3. That is to say, if the 
distribution in question is continuous but not normal, and if rj; approaches 
the limit described by Fig. 13.3, we can say that the genuine correlation 
approaches 1.00 more closely than the obtained rj; does. If the obtained 
rp should exceed the limit, for the size of p involved, it probably means 
that the assumption of a genuine dichotomy is the correct one. In other 
words, when there is a point distribution, rp; can approach 1.00. Many 
distributions are in the doubtful class; they are neither dichotomous nor 
continuous. At least, if they are continuous, they may not be unimodal. 
It is to help take care of these twilight instances that Fig. 13.3 was designed. 

If it develops after we have computed rjj, that the situation justifies the use 
of 7, we can convert the obtained rj; to the appropriate r, by means of 
formula (13.14a). If we have computed r, when it later develops that we 
should have used ri, formula (13.140) will provide the proper transformation. 


TETRACHORIC CORRELATION 


A tetrachoric r is computed from data in which both X and Y have been 
reduced artificially to two categories. Under the appropriate conditions 
it gives a coefficient that is numerically equivalent to a Pearson r and may be 
regarded as an approximation to it. It is sometimes the only way of estimat- 
ing the correlation between two variables because the data could not be 
obtained in graded quantities. It is sometimes a quick and convenient 
method of estimating r from data that are in the form of continuous measure- 
ments, but time is an important consideration and the sample is large. 

Assumptions Underlying the Tetrachoric r. The tetrachoric r requires 
that both X and Y be continuously variable, normally distributed, and 
linearly related. A problem in which the tetrachoric r may be computed 
is illustrated in Table 13.6, if we are willing to make the necessary assump- 


TABLE 13.6. FougroLp TABLE FROM WHICH A TETRACHORIC COEFFICIENT OF 
CORRELATION Is COMPUTED 


Question I 
Yes No Total Proportion 
— Yes 374 167 зи | .s& 
P a | (9 W) 
$ No 186 | 203 389 | 418 
5 (с) (d) 4) 
© Total 560 370 930 | 1.000 
Proportion +602 .398 1.000 


@') @) 
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tions. These data represent the numbers of students responding “Yes” 
and “No” to two questions in a personality questionnaire. Question I 
was, “Do you enjoy getting acquainted with most people?" and ques- 
tion II was, “Do you prefer to work with others rather than alone?" Out 
of 930 replies to both questions, we have the numbers who responded similarly 
(cells a and d in Table 13.6) and the number who responded differently to 
the two questions (cells b and c). It is obvious that in the case of a perfect 
positive correlation, all the cases would fall in cells à and d. In a perfect 
negative correlation, they would fall in cells 5 and c. In a zero correlation, 
the frequencies would be proportionately distributed in the four cells.! 

The assumption of continuity and normality of distribution can be defended 
as follows: It is unlikely that all who respond “Yes” to either question do 
so with equal degree of affirmation. It is similarly unlikely that those who 
respond “No” do so with equal degree of negation. It is most likely that 
either question represents a continuum of behavior extending from strong 
affirmation at the one extreme to strong negation at the other. Continuity 
is thus the probable state of affairs, not a real dichotomy. If a continuum 
is granted, the general law of unimodal distribution approaching normality 
in psychological traits may be cited in defense of the other requirement. By 
making the necessary assumptions, at any rate, many things can be done 
with such data that would otherwise be impossible. As in most statistical 
operations where true form of distribution is unknown, we can here remember 
that we have taken the chance of faulty assumptions and interpret results 
with the requisite reservation. 

The Equation for the Tetrachoric 7. The complete equation for the 
tetrachoric r is a long and complicated one, involving a series including 
many of powers of r. The first few terms included, it reads 


idest V aan TNR Tad: — А 
6 BET 


The symbols will be explained with reference to Table 13.6. The letters 
4, b, c, and d refer to the frequencies in the four cells of the fourfold table. 
rı is given the subscript to indicate that it is a tetrachoric 7. Numerically, 
it approximates a Pearson r. 

In Table 13.6, it will be noted that the distribution of responses to ques- 
tion I is given in terms of proportions p- and 9%. The distribution of all 
responses to question II is similarly given in terms of p and q. These propor- 


ru rh = ＋ (13.15) 


* It will be noted that the categories for X are in an unusual order (positive, or “good,” 
end toward the left), which makes the regression “line” slope downward to the right for a 
positive correlation. For some reason, tradition has kept to this arrangement. Other 
2 X 2 tables reverse this order, in keeping with the usual scatter diagram. Then the 
letters a and b, also c and d, are reversed. Letters a and d always stand for like-signed 
cases in this volume. 
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tions are required for finding the values for the y’s and 2’s in formula (13.15). 
The symbols z and z’ stand for the standard measurements on the base 
line of the unit normal distribution curve at the points of division of cases 
in the two distributions. The symbols у and у' are ordinates corresponding 
to z and z' in the unit normal distribution. 

Methods of Estimating the Tetrachoric r. The solution for r, by means 
of formula (13.15) is a formidable task and can be only an approximation, 
at best. Consequently, numerous short-cut methods have been devised for 
estimating it. Some of these will now be described. 

The Cosine-pi Formula. One approximation formula for r; is known as 
the cosine-pi formula. In mathematical form, 


coi cos (x an) 


Since for computing purposes can be taken to be 180 deg., the practical 
form of the equation is 


180° M be Cosine-pi imation to a tet- 
gen = СОВ, US T 550 ( 9 0 proximation to a te! (13.16) 


By dividing numerator and denominator by A/bc, we have a formula that 
is more convenient for computing purposes. It reads 


Tow-pi = COS ла (13.17) 
stad pus ad [Formula (13,16) in simpler form] 


where a, b, c, and d are the frequencies as defined in Table 13.6. 

It is well to remember that 6 and c represent the unlike-signed cases and 
а and d the like-signed cases. When numbers are substituted, the expression 
within the parentheses reduces to a single number, which is an angle in 
terms of degrees of arc. The cosine of this angle is the estimate of r. The 
angle will vary between zero, when either b or c is zero, or both, to 180 deg., 
when either а or d is zero, or both. In the first case, when the angle is zero, 
the correlation is +1.00, and in the second case, when the angle is 180 deg., 

is —1.00. When the product bc equals ad, the angle is 90 deg., the cosine 
of which is zero, and r, is estimated to be .0. 

Applying the cosine-pi formula to the data in Table 13.6, we have 


f. = cos 180° 
‘cos-pi = 
(374) (203) 
(+ t 4 шу 
= соз 70.24° 
= 343 
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'The cosine of an angle of 70.24 deg. (as found by interpolating in Table J, 
Appendix B) is .343. 

In this method, if the angle should prove to be between 90 and 180 deg., 
the correlation is negative. This can be anticipated by noting that the 
product bc is greater than ad. Angles over 90 deg. are not listed in Table J. 
For an angle between 90 and 180 deg., deduct the angle from 180 deg., find 
the cosine of this difference, and give it a negative sign. 

Table M in Appendix B provides a quick solution for feos-pi to two decimal 
places. Only the ratio ad/bc (or its reciprocal bc/ad) need be known; com- 
pute whichever gives a value greater than 1.0. For the illustrative problem 
above, ad/bc equals 2.444. This lies between the given ratios 2.421 and 
2.490, which indicates a correlation of .34. 

Limitations to the Use of the Cosine-pi Formula. Tt should be pointed out 
that formula (13.17) gives a very close approximation to r, only when both 
variables X and Y are dichotomized at their medians. As рапа y depart 
from .5, as p and ?' differ from each other increasingly, and as ғ; becomes very 
large, reo»: departs more and more from ғ; and is systematically larger than 
т. For example, if p = and p’ = .84, when 7, is .79, rn is approximately 
.90. If both ? and ' are within the limits of .4 to .6, however, when rs is .50 
the maximum discrepancy is approximately .02, and when 7; is .90 the maxi- 
mum discrepancy is approximately .04, both in the direction of overestima- 
tion, In many situations we can control to a large extent the point of dichot- 
omy and can see to it that p and }/ are close to .5. When they are not, it 
would be best to use one of the graphic methods mentioned next. 

Graphic Estimates of Tetrachoric r. When a large number of tetrachoric r’s 
must be computed, considerable saving of labor is provided by the Thurstone 
computing diagrams.* These are highly recommended since they yield two- 
place accuracy with little effort after the fourfold table is reduced to the status 
of proportions throughout, as in Table 13.7. From the computing diagrams, 
r for the data in Table 13.7 is estimated to be +.79. The correlation of the 
two questions of Table 13.6 is estimated as +.34, which checks with previous 
estimates, Another graphic procedure has been published by Hayes.“ 

The Standard Error of a Tetrachoric r. The tetrachoric r is less reliable 
than the Pearson r, being at least 50 per cent more variable. r, is most 
reliable (1) when X is large, as is true of all statistics; (2) when r is large, as is 
true of other r’s; but also (3) when the divisions into two categories are close 


! Bouvier, E. A., Perry, N. C., Michael, W. B., and Hertzka, A. F. A study of the error 
in the cosine-pi approximation to the tetrachoric coefficient of correlation, Educ. psychol. 
Measmt., 1954, 14, 690-699. 

2 Chesire, L., Safir, M., and Thurstone, L. L. Computing Diagrams Sor the Tetrachoric 
Correlation Coefficient. Chicago: University of Chicago Bookstore, 1938. 

Hayes, S. Р. Diagrams for computing tetrachoric correlation coefficients from per- 
centage differences. Psychometrika, 1946, 11, 163-172. 
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to the medians. The complete formula for estimating о,, is too long to be 
practical and so it will not be given here. But when л, = .0, the formula is 
much simpler and reads 


VPP g 6 - 0 5 error of a zero tetrachoric (13 18) 


where the symbols mean the same as in formula (13.15) or in Table 13.6. 
For the 930 cases in the problem of Table 13.6, 


2 (1532)(.602)(.418)(.398) 
" (.3905)(.3858) 4/930 
= .053 


Since the obtained гү, .34, is more than 2.6 times this standard error, we can 
be quite positive that the two qualities represented by the two questions are 
really correlated in the population, 

To attain the same degree of reliability in a tetrachoric r as in a Pearson r, 
one needs more than twice the number of cases in a sample. For very 
dependable results, when 7 is to be computed, it is recommended that № be 
at least 200, and preferably 300. In smaller samples than these, even less 
than № = 100, a tetrachoric r can be used to test the null hypothesis, but it 
cannot be depended upon to give very accurate estimates of the size of corre- 
lation unless 7 is very large. 

Reducing Distributions in Class Intervals to Fourfold Tables. Data need 
not be obtained in two categories each way in order to apply the tetrachoric 
solution for r. Any scatter diagram, in fact, can be reduced to two groups 
each way by making arbitrary divisions. Such a division should be made as 
nearly as possible at or near the median in each distribution. Table 13.7 
shows a scatter diagram in which reduction to a fourfold table would be 
highly desirable. A Pearson r computed with so few class intervals each way 
would be highly influenced by errors of grouping. The very large number of 
cases renders the reduction in reliability of r by computing ri of small impor- 
tance. The divisions suggested in Table 13.7 come between the B's and C's 
for distribution of school marks and at an /Q between 89 and 90 for intelli- 
gence rating. The revised correlation distribution is seen in Table 13.7. 

Some Applications of r, to Be Avoided. Many of the limitations of the 
tetrachoric r have already been pointed out. There are others that should 
not go unnoticed. It is well to avoid estimating r, when the split in either X. 
or Y is very one-sided—for example, a 95-5, or even а 90-10, division of the 
cases. The standard error is much larger in such situations as these. 

1 For aids in estimating оу, see Guilford, J. P., and Lyon, T. C. On determining the 


reliability and significance of a tetrachoric coefficient of correlation. Psychometrika, 1942, 
7, 243-249; also Hayes, S. P. Tables of the standard error of tetrachoric correlation 


coefücient. Psychometrika, 1943, 8, 193-203. 
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TABLE 13.7. THE REDUCTION OF A SCATTER DIAGRAM TO A FOURFOLD TABLE 
PREPARATORY TO THE COMPUTATION OF A TETRACHORIC COEFFICIENT 
OF CORRELATION* 
Mark in Schoolwork 


10 


120 and above. . 


In terms of frequencies In terms of proportions 


10 


С, ог С, ог 

Басё A or B Total below А ог В | Total 
90 or above... 273 296 569 209 .291 .560 
Below 90. 423 24 447 .416 ‚024 ‚440 


Total... 696 320 1016 .685 315 1.000 


* Adapted from Cobb, M. V. ‘The limits set to educational achievement by limited intelligence. 
J. educ. Psychol., 1922, 18, 449. By permission of the publisher. 


Especially to be avoided is an attempt to estimate r; when there is a zero in 
only onecell. Table 13.8, A and B, illustrates two such examples. Ifr, were 


TABLE 13.8. ILLUSTRATIONS OF SOME UNUSUAL FOURFOLD CONTINGENCY TABLES 
IN WHICH Computation Or A TETRACHORIC r Is QUESTIONABLE 


200 200 
90. 200 

290 400 
A 


computed for problem A, it would equal —1.0 (the zero is in cell a); if com- 
puted for problem В, r, would equal +1.0. This is in spite of the fact that 
about one-fourth of the cases belie the perfect correlations apparent by com- 
putation (90 cases out of 400 in A are out of line with the finding and 80 cases 
in B). 

These examples are perhaps somewhat rare, but zero frequencies are cer- 
tainly not unheard of. Even scatters like that in C would probably give a 
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false estimate of correlation. There is no zero, but there is an exceptionally 
small frequency (15) among much larger ones. In all three fourfold tables 
the distributions are such as to suggest nonlinear regressions if these broad 
categories were broken down into finer groupings. If the assumption of 
linearity is not satisfied, 7; may well give a biased estimate of correlation. 
Such distributions as those in Table 13.8 are not proof of nonlinear regression, 
but they strongly suggest it. In general, a distribution in such a table should 
appear to be rather symmetrical around one diagonal axis or the other, 
depending upon whether the correlation is negative or positive. This holds 
true if the proportion р is somewhat near the proportion 9’, but if they differ 
too much, asymmetry cannot be taken necessarily to mean curved regression. 


Tue Pur COEFFICIENT 


When the two distributions correlated are really dichotomous, when the 
two classes are separated by a real gap between them and previous correla- 
tional methods do not apply, we may resort to the phi coefficient.! This was 
designed for so-called point distributions, which implies that the two classes 
have two point values or merely represent some unmeasurable attribute. 
Such a case would be illustrated by eye color, sex membership, “living versus 
dead," and the like. The method can be applied, however, to data that are 
measurable on a continuous variable if we make certain allowances for that 
fact. It is a close relative of chi square, which is applicable to a wide variety 
of situations. 

The Computation of Phi. To illustrate the use of phi ($), we shall use 
again some data that were previously employed with chisquare (see Table 11.1). 
They are repeated here as we need them, in proportion form, in Table 13.9. 


TABLE 13,9. A TABLE TO ILLUSTRATE THE CORRELATION OF ATTRIBUTES 


Normal| Feebleminded | Both 

Married. кә» кой Бү ‚269 .204 413 
(a) (8) (№) 

.231 .296 .527 
(y) (8) (4) 

500 .500 1.000 


000 000 


The formula for the phi coefficient is 
ae aó — By 

МРТ 

1 Also known as the Yule ¢ or sometimes as the Yule-Boas Фф. See Yule, G. U. On 


the methods of measuring the association between two attributes. J. Roy. Stat. Soc., 
1912, 76, 576-642. 


(The phi correlation coeficient) (13.19) 
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where the symbols correspond to the labeled cells in Table 13.9.1 The solu- 
tion of ¢ for this table is 
(209) (290) — (.204)(.231) 
М/(473)(.527)(.5)(.5) 
= .1302, or .13 
The Relation of Phi to Chi Square. Phi is related to chi square from a 
2 X 2 table by the very simple equation 
х? = Ng (Chi square as a function of phi) (13.20) 


and phi is derived from chi square by the equation 
nc 
ф= 45 (Phi as a function of chi square) (13.21) 


By formula (13.20), for the data of Table 11.1, 


x? = (412)(.016952) 
= 6.98 


This checks with the solution of chi square by other methods (see Chap. 11). 

Since phi can be derived directly from chi square, when the latter is applied 
to a 2 X 2 table, any of the formulas for chi square given in Chap. 11 will 
apply to its computation. Formula (11.5), especially, which is very similar 
to formula (13.19) above, is probably most convenient. Applied directly 
to the computing of phi, it becomes 


2 ad — bc 
М/@ I B)(a +00 +4(с+@ 


On a computing machine, it is more convenient to compute ф?, which means 
squaring the numerator and omitting the step of taking the square root of the 
denominator. From ¢? one сап compute either chi square or phi in a single 
additional operation. 

The Special Case of Phi When One Distribution Is Evenly Divided. When 
one of the distributions, let us say the one for which we use and 4' as total 
proportions, is evenly divided so that Ё = q' = .50, the solution of ¢ is con- 
siderably simplified. The formula reads 


$ 


(Phi computed from frequen- (13.22) 
cies) j 


$- VE (Phi from evenly divided proportions) (13.23) 
Applied to the data on marital status 
.269 — .204 
-VABI 
= .130 


1 For a derivation of formula (13.19), see Appendix A, 
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"This particular case is useful in many an experimental situation where two 
separated groups are selected with equal numbers of cases. There is some 
question here, of course, as to how well the samples chosen represent the 
larger population from which they were obtained, and so interpretations 
should be stated with this knowledge in mind. 

The Reliability and Significance of Phi. The formula for the estimation 
of the standard error of phi involves such laborious computations that it is 
impractical for general use. It will not be given here. A test of the null 
hypothesis, fortunately, can be made through phi's relationship to chi square. 
If x? is significant in a fourfold table, the corresponding ¢ is significant. The 
procedure, then, is to derive the corresponding x? from the obtained ¢ by 
means of formula (13.20), then examine Table E to find whether for 1 degree 
of freedom the required standard of significance is met. In the marital 
problem, we find that a chi square of 6.98 is significant beyond the .01 level; 
therefore the obtained phi of .13 is likewise significant.’ 

An Evaluation of the Phi Coefficient. Phi is actually a product-moment 
coefficient of correlation. Its formula is a variation of Pearson’s fundamental 
equation, r = Zxy/Ne,s, The similarity may be seen to some degree, at 
least, if we break the denominator of formula (13.19) into two components, 
м/д and Mp. These are the standard deviations of the two point dis- 
tributions, in Y and X. If we give numerical values of +1 and 0 to the two 
categories in X and in Y, and if we carry through the computation of a Pear- 
son 7 in a scatter diagram of four cells, we arrive at a correlation coefficient 
equal to ¢. 

Limitations to the Size of Phi. While ¢ can vary from —1.0 to +1.0, only 
under certain conditions can ¢ be as large as either of these extremes, even 
though a tetrachoric r if computed for the same data, would yield an r, equal 
to 1. This is probably its greatest weakness, but in certain practical situ- 
ations it is a realistic feature. The reason is that a 2 X 2 table places serious 
restrictions upon ¢ that do not affect 7. The general principle is that ф can 
be as great as 1 only when p = p’ or p = 0 (and, of course, = 4 org = р). 

To illustrate these restrictions, we may take a few special cases in which 
ф = .5 but р' is allowed to vary. Such instances are pictured in Table 13.10. 
With an even division of the cases in the two categories in Y, only with an 
even division also in X is it possible to have a perfect correlation, as shown in 
contingency tables A and B. With a division of 75-25 in variable X, the 
maximum ¢ would be .58 (contingency table C) and with a 90-10 division, the 
maximum ¢ would be .33. In contingency table E the division in X is again 
75-25 but there is departure from maximal relationship. The obtained ¢ of 
.35 may be interpreted for size in the light of the maximal ¢ possible with the 
particular combination of marginal totals, if we are interested in the under- 


According to McNemar, we may use 1/4//N as the standard error of ф (when ф = 0) 
if N is not small (see Psychological Statistics. 24 ed. New York: Wiley, 1954. P. 203). 
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TABLE 13.10. Some FOURFOLD CONTINGENCY TABLES ILLUSTRATING THE DEPENDENCE 
OF THE SIZE OF A Pur COEFFICIENT UPON THE MARGINAL TOTALS 


$ = +1.0 ф = —1.0 ф = .58 ф = .33 ф = .35 
А B с р Е 
lying strength of relationship between X and V. If we are interested in mak- 
ing predictions from categories to other categories, however, the obtained ¢ is 
a more realistic figure. The problems of prediction come in the chapters to 
follow. 

Determination of a Maximal Phi Coefficient. Because of the increasing 
importance of the phi coefficient, particularly in connection with test-item 
intercorrelations, it is desirable for the purposes of orientation to have some 
conception of the drastic limitations to the size of phi. In general, the maxi- 
mal ¢ for any combination of marginal proportions can be calculated by 
means of the formula 


(Maximal value 
E. (2) (8) (where ji S 9) fre with di (13.24) 


qi erent combi- 
nations of P; 
and pj) 
where p; = largest marginal proportion in a 2 X 2 contingency table and 
I = the corresponding marginal proportion in the other variable. Where- 
ever pi = фу, the maximal $ equals 1.0. To apply formula (13.24) to 


Table 13.10, C and Ё, 
iy .50\ /.25 
— = (39) (33) 


= 58 


Computations with formula (13.24) are greatly facilitated by use of Table G 
where values of M. b/q and V. q/p are given. Formula (13.24) can be broken 
into the two components 4/;/j; and Mqiſ pi whose product gives the maxi- 
mal phi. 

Figure 13.4 provides a graphic solution to the same equation for values of 
fi from .50 through .98 and р; throughout the same range. These ranges will 
take care of many of the situations in which $ would ordinarily be com- 
puted. It is recommended that the maximal ф that suits any given situa- 
tion be considered when interpreting an obtained ф as representing a 
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strength of the intrinsic relationship between two variables. The word 
intrinsic is stressed here, because the actual size of ф indicates the degree of 
practical, predictive value of the relationship. Predictive value is actually 


restricted by inequality of p; and ру. 
T TRIS 


e 
ы 
© 


о 
o 
o 


Maximum Phi coefficient possible . 
e 
© 


0.70 0.80 0.90 
Smaller of the two proportions (Pj) 
Fio. 13.4. Maximal phi coefficients for different combinations of proportions of cases in 
corresponding categories in X and Y when both have the larger frequencies. 


The Coefficient of Contingency. It has been shown how a ¢ coefficient can 
be derived from chi square. Phi squared, for a 2 X 2 table, is equal to chi 
square divided by V. For this reason $* has been called the mean-square 
contingency. By analogy, we might call ф the mean contingency, although 
this name is not used for it. When there are more than two classes in either 
X or Y, or in both, however, there is another correlation index, called the 
coefficient of contingency, and it is designated by the letter С. The formula 
for deriving it from chi square is 


5 FA (Coefficient of contingency computed from chi 
id л! у+а sque) (13.25) 
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Like ¢, the coefficient of contingency is restricted in size, but not to the 
sameextent. When the number of categories is large (at least five each way), 
C approaches the Pearson r in size. If the categorized data represent con- 
tinuous, normal distributions, if N is large, and if class intervals are of approxi- 
mately equal size, the correction procedures applied to the Pearson ғ, described 
later in this chapter (Table 13.15), may be applied to the C coefficient. If 
the data are in genuine categories (point distributions, or nearly so), it is best 
to interpret C as it is. The maximum C for each given number of categories 
each way is shown in Table 13,11, 


TABLE 13.11. MAXIMAL VALUES ATTAINABLE FOR A COEFFICIENT OF CONTINGENCY 
with DIFFERENT NUMBERS ОР CATEGORIES IN BOTH X AND Y VARIABLES 


The standard error of C involves so much computation that it is hardly 
worth the effort to estimate it. A formula for this is given by Kelley.! For 
testing the hypothesis of zero correlation in a population, the chi square from. 
which C js derived will serve very well. 


PARTIAL CORRELATION 


The Meaning of Partial Correlation. A partial correlation between two 
things is one that nullifies the effects of a third variable (or a number of other 
variables) upon both the variables being correlated. The correlation between 
height and weight of boys in a group where age is permitted to vary would be 
higher than the correlation between height and weight for a group at constant 
age. The reason is obvious, Because boys are older, they are both heavier 
and taller. Age is a factor that enhances the strength of correspondence 
between height and weight. With age held constant, the correlation would 
still be positive and significant, because at any age taller boys tend to be 
heavier. 

If wë wanted to know the correlation between height and weight with the 
influences of age ruled out, we could, of course, keep samples separated and 
compute r at each age level. But the partial-correlation technique enables 
us to accomplish the same result without so fractionating data into homogene- 
ous groups. When only one variable is held constant, we speak of a first-order 
partial correlation. The general formula is 


ie Tis in (First-order partial coefficient of 
Ln V Tru 25 орана (13.26) 


Ina group of boys aged 12 to 19, the correlation between height and weight 
(ru) was found to be .78, Between height and age, ri; = .52. Between 


Kelley, T. L. Statistical Method, New York: Macmillan, 1923. P. 269. 


P 
+ 
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weight and age, rs = .54. The partial correlation is therefore 


* 78 — (.52)(,54) 
mim e i-. 


With the influences of age upon both height and weight ruled out or nullified, 
then, the correlation between the two is .69. 

As another example with three variables, the correlation between strength 
and height C) in this same group was .58, The correlation between strength 
and weight Cas) was .72. Although there is a significantly high correlation 
between strength and height, we wonder whether this is not due to the factor 
of weight-going-with-height rather than to height itself. Accordingly we 
hold weight constant and ask what the correlation would be then. Will boys 
of the same weight show any dependence of strength upon height? The 
correlation is given by 
.58 — (.72)(,78 
1 — , — .78 
= 042 


By partialing out weight, it is found that the correlation between height and 
strength nearly vanishes. We conclude, therefore, that height as such has no 
bearing upon strength, but only by virtue of its association with weight does 
it show any correlation at all. 

Second-order Partials. When we hold two variables constant at the same 
time, we call the coefficient a second-order partial r. The general formula is 


fan: 


fi = ғи.) — 
rau = i Е =- Ter = Bacon coder res ef (13.27) 
In using this formula, the subscripts will have to be modified to suit the choice 
of variables, Here we are assuming that we want to know the correlation 
that would occur between X, and Xs with the effects of Xs and X, eliminated 
from both. It is clear that this formula requires the solution of three first- 
order partials previously. 

As an example of this partial, we may cite the correlation between strength 
and age with height and weight held constant. This would mean that if a 
group of boys having the same height and weight were taken, would older 
boys be stronger? The raw correlation between age and strength was .29, 
The second-order partial also turned out to be .29. This means that it seem- 
ingly makes no difference whether we allow height and weight to vary or 
whether we do not; the relation between age and strength is the same within 
the range examined, 

Some Suggestions concerning Partial Correlation. Needless to say, unless 
the assumptions necessary for computing the Pearson r’s involved are ful- 
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filled, there is little excuse for using them as the basis for computing partial 
correlations. There are actually few occasions in psychology and education 
when a partial r is called for. The partialing out of such things as chronologi- 
cal age is perhaps the most common instance in which itisa useful device. It 
is not to be recommended as a lazy man's substitute for experimental control 
and fractionation of data. The newer processes of analysis of variance and 
tests of significance of statistics from small samples make experimental 
planning seem more important and the treatment of results more satisfactory 
without resort to partial correlations. 

Reliability and Significance of an Obtained Partial r. The standard error 
of a partial coefficient of correlation is the same as for a Pearson r except that 
the number of degrees of freedom should be used in the denominator. The 
general formula is 


1-745 


gr = F 


(Standard error of a partial 7) (13.28) 


where m is the number of variables involved. 


Some SPECIAL PROBLEMS IN CORRELATION 


The Relativity of All Coefficients of Correlation. It is apparent that the 
size of the coefficient of correlation depends to some extent upon the method 
of computing it. What is more important, coefficients computed between 
the same two variables by the same procedure will vary not only from sample 
to sample but from population to population. If there are any really absolute 
correlations in the universe, all variables except the two being held constant, 
those correlations are probably either zero or 1, or close to either of those 
values. With contaminating variables left in, the correlations are usually 
between zero and 1. It is therefore really meaningless to speak of the correla- 
tion between intelligence and character (if it is assumed even that we know 
what those variables are and have properly measured them) or even between 
age and height or any other common variables without at the same time 
specifying what kind of sample we measured. 

A coefficient is always relative to the kind of population sampled and to the 
manner in which the measurements were made. In reporting coefficients of 
correlation, any writer should be very careful to state all the pertinent factors 
that bear upon the size of his obtained correlation coefficients, and any reader 
should accept interpretations only when the significant circumstances are 
kept in mind. A few of the more common sources of variations of size of r 
will be reviewed briefly in what follows. 

The Variability in the Correlated Variables. The size of r is very much 
dependent upon the range of ability or, in more general terms, the variability 
of measured values, in the correlated sample. The greater the variability, 
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the higher will be the correlation, everything else being equal. It should be 
easier to predict individual differences in scholarship in a class with 7Q's 
ranging from 50 to 150 than in a class where the range is restricted to 90 to 
110. If the restriction were to a range of zero (all /Q's being equal) there 
should be no correlation whatever—the limiting case, in which, of course, no 
r could be computed at all. Often we know the correlation between some 
predictive index, such as aptitude-test score and scholarship or some voca- 
tional criterion of success as derived from one group of individuals, but we 
shall be applying the same index to other groups with different ranges of 
ability, larger or smaller. What will be the effectiveness of predictions in the 
new groups? 

In the selection of personnel by means of tests, as during World War II, 
.research on selective instruments was constantly beset with this very practical 
problem. New tests were put into use in the selection of personnel, and they 
correlated substantially with tests that were used in selection. The result 
was that the men who went into training represented only a higher segment 
of the population from which selection was to be made by the new tests. The 
validity of a test could be estimated only for this higher segment of restricted 
range. And yet, it was the validity in the total population that it was impor- 
tant to know, for it is that validity which indicates the full selective value of 
the test. The coefficient of validity in the restricted group is almost invaria- 
bly smaller than what it would be in an unrestricted group. 

Tn a research program such as that on the selection and classification of 
aviation trainees during World War II, the problem of restriction of range 
became quite important. Near the end of the war, about 50 per cent of the 
applicants for aircrew training failed to pass the general qualifying examina- 
tion, and of these as many as 75 per cent failed to qualify for a particular type 
of training. Furthermore, it was desired to correlate tests with advanced- 
training achievement criteria and even combat performance after many more 
had been eliminated at various stages of training. The proportions of the 
original applicants who survived to these stages were rather small. Restric- 
tion of range was very great. 

Karl Pearson, many years ago, provided a solution that applies under cer- 
tain conditions. The variables being studied must be normally distributed 
in the population and we must know certain parameters or estimates of them 
in order to solve the problem in any particular situation. We need to know 
the relation of the dispersions in the restricted and unrestricted populations, 
either in terms of the variable on which selection occurred or on the basis of 
some variable correlated with it. We also need to know the correlation in 
the restricted population between the variable we wish to validate and the 
criterion of success in training or on the job. There are three formulas of 
practical use in this problem, each of which recognizes the availability of cer- 
tain information and the need for validation of a certain kind of variable. 
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Case I. Restriction is produced by selection on the basis of Хх, and 
there is knowledge of standard deviations in X; for both restricted and unre- 
stricted groups. The correlation rj? is known in the restricted group. The 
correlation Ria for the unrestricted group is estimated by 


2) 
„(е 
> 91 (Correlation corrected for re- 
Ru = 7 striction of range, Case I) (13.29) 
1— ri d rh (2) 
1 


where ris = correlation between X; and X. in the restricted group 
д = standard deviation in measurements on X; in the restricted group 
Di = standard deviation in the same variable in the unrestricted group 
Tn this and in the next two formulas, capital letters stand for values pertaining 
to the unrestricted population and lower-case letters refer to the restricted 
population. 

"The application of this formula is as follows: Suppose that the selection 
test (X1) correlated .30 with the training criterion in the group selected on the 
basis of the test. The standard deviation in the unrestricted group (21) was 
20 and that in the restricted group (ci) was 10. The solution is 


30 (2) 
Ru = 


n 1 — .09 + (.09) 
= 53 


20 
10? 


Case II. Restriction is produced by selection on the basis of X 1, and 
there is knowledge of standard deviations for X; in both restricted and unre- 
stricted samples and of the correlation r; in the restricted group. The corre- 
lation in the unrestricted group is estimated by 


3, 1; "PR 
Ris = | — = (1 — riy) A rca oot eect (13.30) 


where сз = standard deviation on X, in the restricted group and Za = stand- 
ard deviation on X in the unrestricted group. This formula would apply 
when we correlate two selection tests, when we have selected on the basis of 
one test (xi) but know the change of range from knowledge of variances in 
the other test (Хз). One or both of the “tests” might be a composite score 
derived from a combination of several tests, An example of this from 
aviation psychology was the correlation of an experimental test with the pilot 
stanine (composite aptitude score) when selection had been made on the 
basis of the stanine and it was more convenient to use the change in dispersion 
on the test. If we assume the same restricted correlation (ris = 30) as in 
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the previous illustration, also that the restricted and unrestricted standard 
deviations are 10 and 20, respectively, 


Я 
Ry = q! = 205 ( — .30?) 
= 88 


Case III. Restriction is produced by selection on variable X, on which 
variable the restricted and unrestricted standard deviations are known. We 
wish to estimate the unrestricted correlation R1», when we also know 712, 713, 
and rs. The formula is 


ru + 718723 (2: е^ 1) 
(13.31) 


dre o] Gr] 


(Correlation corrected for restriction of range, Case III) 


К as 


where the symbols are defined similarly to those in formulas (13.29) and 
(13.30). This formula would apply to the correlation of a new, experimental 
test X, with a practical criterion Xa, when selection had been made on the 
basis of a third variable (pilot stanine, for example) X3. 

The reader may have been somewhat surprised at the rather radical change 
in correlation that occurred as we corrected for restriction of range in the two 
hypothetical problems above. To show that these changes are not unreason- 
able, some data will be cited from the AAF results.! An experimental group 
of more than a thousand pilots had been permitted to enter training without 
any selection whatever on the basis of either qualifying or classification tests. 
We know, then, how the pilot stanine and certain classification tests corre- 
lated with the graduation-elimination criterion at the end of training. We 
can also arbitrarily pull out a high segment of the total sample and within 
that limited sample compute validity coefficients. The results are given in 
Table 13.12 for the instance in which a rather high, but not unknown, selec- 
tion of the top 13 per cent occurred. It can be seen that where there were 
substantial correlations in the unrestricted sample the correlations in the 
selected group often shrank close to zero and, in one instance, to a trivial 
negative r. On the whole, those tests that correlated highest with the 
stanine lost most in validity correlation because of selection on the basis of 
the stanine. 

Evaluation of the Correction Formulas for Restriction, It should be repeated 
that the problem of restriction is important, and that if one wishes to avoid 
wrong conclusions, when a substantial amount of selection has been made, 
one should apply correction procedures. Had we taken the second (restricted) 

1 Thorndike, R. L. (ed.). Research Problems and Techniques. AAF Aviation Psychology 
Research Program Reports, No. 3. Washington, D.C.: GPO, 1947. 
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set of coefficients in Table 13.12 seriously, without other knowledge to the 
contrary, we should probably have concluded that formerly valid tests, and 
even the stanine, had lost their former validities that were known early in the 
war when selection was a cause of little restriction. 


TABLE 13.12. VALIDITY COEFFICIENTS FOR SELECTIVE TESTS AND A COMPOSITE SCORE 
FOR THE SELECTION OF PiLOT STUDENTS WITH AND WITHOUT RESTRICTION 


OF RANGE 
Correlation | Correlation in 
in the total | the selected 
Variable group highest 


13 per cent 
(N = 1,036) | (N = 136) 


Pilot stanine. . IUE E ve 64 18 
Mechanical principle: 44 .03 
General information. 46 .20 
Complex coordination. 40 —.08 
Instrument comprehension. 45 27 
Arithmetic reasoning. . 27 18 
Finger dexterity 18 00 


It should be remembered that the formulas rest on the assumption of nor- 
mal distributions of the population on the variables used, and the Pearson 
product-moment r is presupposed. The use of the biserial r or tetrachoric r as 
an estimate of it raises considerable question when selection is severe. 
Experience tends to show, however, that when the biserial r is used as the 
validation coefficient, the formulas tend to underestimate the unrestricted 
correlation. The standard errors for these corrected coefficients are unknown, 
but it is probable that they are much larger than those for Pearson 7's of com- 
parable size. 

Correlations in Heterogeneous Samples. Studies of validity of tests and 
examinations have frequently been faulty from a number of standpoints. 
The use of school marks as criteria of success in training is in itself a question- 
able procedure, school marks being derived as they generally are on the basis 
of measurements of questionable reliability and validity and contaminated 
with irrelevant factors. This situation alone stacks the cards against high 
validity coefficients for predictive indices. 

There is another factor working against fair tests of validity that we shall 
face particularly here, a factor also dependent upon the unwarranted faith in 
school marks as dependable measures of scholarship. This factor is the indis- 
criminate pooling of marks from different subjects and from different instruc- 
tors and treating them as if they were of the same kind of coin. Any cursory 
inspection of grade distributions in a single institution of learning will show 
that marks are not by any means of constant value when obtained from 
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different sources. The reader is referred to the situation in Fig. 14.2 where 
students in an English course making the same score in a common achieve- 
ment examination were assigned different marks in different sections and by 
different instructors, probably within the same section. If it is assumed that 
the comprehensive examination was a valid measure of the students’ relative 
degree of mastery of the objectives of the course, it can be seen how much 
other factors must have entered into the determination of the final mark in 
the course. 

Reference to Fig. 14.2 will show that there is quite a range of scores, from 
about 85 to 125, within which students were assigned marks all the way from 
F to B. Only as between marks of F and A is there rather complete lack of 
overlapping. Striking as this situation is, it is probably rather representative 
of how much lack of correlation there is between school marks and genuine 
achievement. Much of this is due to the fluctuation of marking ideas and 
ideals from instructor to instructor. This variation from set to set of marks 
when they are collectively correlated with other measures is bound to alter 
the apparent amount of correlation. 

As an example, in six sections of freshmen English, within sections the cor- 
relation between quiz averages for the semester and a final comprehensive 
examination ranged from .63 to .92, with an over-all correlation within sec- 
tions, when intersection differences had been eliminated, of .83. Yet when the 
six sections were combined, with intersectional differences left in, the correlation 
was reduced to. 71. It was interesting to find that between sections the corre- 
lation was —.17, which means that there was a very slight tendency for sec- 
tions with average lower achievement to be given a higher average quiz mark! 
This fact accounts for the reduction in correlation from .83 to 1 when sec- 
tions were combined. 

Figure 13.5 pictures the kind of situation just described, in somewhat 
exaggerated form, in diagram II. Diagram II is best understood by con- 
trasting it with diagram I. In the latter we have a homogeneous combina- 
tion of four subsamples drawn from the same population. The correlation 
between X and Y within each subsample is indicated by a smaller ellipse. 
All the ellipses are of about the same shape, indicating about the same degree 
of correlation of X and Y. The x marks indicate the means of Y and X 
within each subsample. If we combine the four samples, we obtain a dis- 
tribution described approximately by the large dotted ellipse. Note that the 
proportions of the large ellipse are about the same as for each small ellipse, 
indicating the same level of correlation within the composite distribution as 
within each subsample. Note, also, that the distribution of the four means 
forms roughly an ellipse of similar proportions. If the correlation between 

1 Further discussion of “within” versus “between” correlations when groups are com- 
bined will be found in E. F. Lindquist’s Statistical Analysis in Educational Research. Pp. 
2197. 
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means of V and means of X differs from that within subsamples, the cor- 
relation of X and Y in the composite sample will differ from that within 
subsamples. 

In diagram II of Fig. 13.5 we have a very different situation. While within 
each subsample the correlation between K and I is the same, the subsamples 
did not arise from the same population so far as means are concerned. An 
ellipse drawn to enclose the x’s would slant in the direction to assure a nega- 
tive correlation between means of K and means of L. The effect of this can 


Scores in Y 
Scores ink 


Scores in X Scores in K 
Diagram! Diagram 


Scores inT 


Scores in 5 


Diagram I 


Fro. 13.5. Illustration of correlation in homogeneous and heterogeneous groups of sub- 
samples. 


be seen in the dotted line enclosing all subsamples. Its form suggests 
approximately zero correlation. Such situations are not uncommon. In 
general practice, if it is doubtful whether subsamples arose by random 
sampling from the same population, it would be best to compute correlations 
within subsamples separately or to apply equivalent procedures which we 
shall not take the space to describe here.! The hypothesis of homogeneity of 
samples can be made by means of / tests or F tests as described in Chap. 10. 

The Correlation of Averages. It was stated in an earlier chapter in con- 


nection with tests of significance of differences between statistics (Chap. 9) 
1 See Lindquist, o. cit. 
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that the correlation between averages of samples is equal to the correlation 
between individual pairs of measurements. This statement assumes random 
samples from а homogeneous population. Diagram I in Fig. 13.5 illustrates 
this kind of situation and shows how an r obtained within one sample can be 
used as an estimate of a correlation between means. Diagram II shows how 
a correlation coefficient obtained within a single sample might be very mis- 
leading as to the amount of correlation between means. This shows an 
instance in which the correlation between means is decidedly lower, if not 
reversed in sign, than that within samples. 

The correlation between means could also be higher than that within sam- 
ples, as diagram III shows. An example of this would be the correlation 
between /Q and salary. Correlating individuals, we should find some posi- 
tive correlation, but because of great variations in salary at any single 10 
value, the correlation might not be very high. If we divided men into sets 
according to vocation and correlated average ТО with average salary, the coeffi- 
cient would probably be very high. This is because people of different /Q 
levels gravitate to certain occupations, and occupations as such have estab- 
lished characteristic salary scales. Other factors that make for individual 
differences in salary within occupations are thus minimized in importance. 
The sampling is biased the moment we divide groups along occupational lines. 

Averaging Coefficients of Correlation. One solution to the problem of 
correlations in some heterogeneous samples is to estimate the correlation 
between X and У within each subsample and then average the coefficients in 
order to obtain a single estimate of the population correlation, This would 
presumably describe the relation between X and У throughout the composite 
sample, free from whatever sampling biases there may have been in segre- 
gating the subsamples, Before averaging coefficients, however, we must 
make the assumption that the several r's did arise by random sampling from 
the same population—same with respect to the degree of correlation, It 
should go without saying, also, that we have correlated the same variables in 
all samples. The test of homogeneity of the r's themselves would be based 
upon their standard errors. 

There are several procedures sometimes used in averaging ғ'в. Coefficients 
of correlation are not values on a scale of equal metric units; they are index 
numbers. Differences between large r's are actually much greater than those 
between small r's. If the few sample y's to be averaged, however, are of about 
the same value and if they are not too large, a simple arithmetic mean will 
suffice. If the 7's differ considerably in size and if they are large, some writers 
urge the procedure that involves Fisher's z coefficients. This procedure 
is illustrated in Table 13.13. It consists of transforming each r into a 
corresponding z (Table H may be used for this purpose), finding the arith- 
metic mean of the z's, and, finally, transforming the mean 1 back to the 


corresponding r. 


326 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION lcu. 13 


TABLE 13.13. DEMONSTRATION OF AVERAGING COEFFICIENTS OF CORRELATION WHEN 
r's DIFFER IN RANGE AND IN Size 


Sample A Sample D 


Mean oí r |z method | Mean of r |z method | Mean of r |z method | Mean of r | z method 


45 48 37 ‚65 78 

50 55 .62 .85 1.26 

42 45 83 98 2.30 

38 .40 55 80 1.10 

55 ‚62 60 88 1.38 
2 2.30 2.50 


3.03 4.16 6.82 
` 1.364 


The results of Table 13.13 show differences to be expected in the use of an 
arithmetic mean of r’s and of corresponding Z's. Samples A and B have the 
same range of r’s, those in B being merely .30 greater than those in A. In 
sample A, agreement is perfect in the results from the two methods. In 
sample B, the mean r by the z method is .01 higher (.77 as compared with .76). 
In samples C and D there is much more spread in the r’s averaged. For the 
r’s of moderate size, sample С, the z method gives a result only .01 greater 
than the simple mean of r’s. In the high coefficients, however, the difference 
is about .05. 

There is serious question whether s differing as much as these would 
satisfy the belief that they came from the same population by random sam- 
pling and hénce would be candidates for averaging. When a few r’s do satisfy 
this belief, the chances are that any discrepancy between a simple mean of 7’s 
and an average obtained by the z method would be small as compared with 
the standard error of 7. If the 7's did come from the same population, a mean 
of several would be a much more reliable estimate of population correlation. 
With the requirements satisfied, we could add degrees of freedom from the 
different subsamples to represent the degrees of freedom of the mean r and 
interpret its reliability and significance accordingly. 

Weighting Coefficients in Averaging. One more requirement should be 
mentioned, particularly if the last operation, combining degrees of freedom, 
is to be carried out. That is to weight the obtained r’s in averaging them. 
The weight for each sample is its number of degrees of freedom (N — 2). 
In using the z method, the weights are applied to the z’s. The weight to be 
applied to а 2 is its corresponding N — 3. A discussion of weighted averages 
was given in Chap. 4. 

The Correlation of Parts with Wholes. We frequently want to correlate a 
part measurement, such as a part of a test battery, or a test item, with the 
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whole of which it is а part. Since the variance of the total is in part made up 
of the variance of the component, that fact alone introduces some degree of 
positive correlation. The greater the relative contribution to the total 
variance by the component, the more important is this “spurious” factor. 
It is possible in a particular instance that the part is totally uncorrelated with 
the remaining parts and yet will be correlated with the total. If it is nega- 
tively correlated with the remaining parts, it will be less negatively correlated 
with the total. 

1f each part contributes statistically about the same amount of variance 
to the total or if the part is one of a great many, so that its proportion of con- 
tribution is relatively small, we can compare correlations between parts and 
total with some confidence that they are compared on a very similar basis. 
But if these conditions do not obtain, we should do better to correlate each 
part with a composite of all other parts. When such a composite is unknown 
or is hard to obtain, we can still estimate the correlation by means of the 
formula 


1, - (Correlation of part with a remain- 
Tp = ——— — der, Бао ар correlation of part (13.32) 
VERRE Tae, with tol) 


where p = part score 
1 = total score 
q = t — p, in other words, the total with the part excluded 
Tn the correlation of test items each with the total score of the test of which 
they are a part, particularly, it is important to know about how much a part 
would correlate with the total when there is really no relationship at all. We 
can estimate this, but only under the condition that each part has the same 
variance and there is zero intercorrelation among all parts. Under these 
special conditions the average amount of correlation of a part with the total is 
given by the equation 


(Average correlation of a number of parts, of е ual vari- (13,33) 


я 1 
inc n ance and zero intercorrelation, with their total) 


п 
in which » = number of parts.! 

If we should want to know the correlation of a part with a whole of which 
it is a part and we already know the correlation of the part with the remainder 
of the whole, the estimate is made by the equation 


Correlation of part with whole, 
fy = MUST : knowing tha correlation between (13.34) 
vo, T 05, + 2754054 part and the remainder) 


1 An adaptation has been made of formula (13.32) to the correction of item-total correla- 
tions for spurious overlap. See Guilford, J. P. The correlation of an item with a com- 
posite of the remaining items of atest. Educ. psychol. Measmi., 1953, 18, 87-93. 
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in which the symbols have the same meaning as in formula (13.32). The 
utility of this formula is probably rather limited. It is given primarily to 
show what happens when two parts that correlate zero are combined. If 
r5, 15 .0 in formula (13.34), the numerator reduces to gp». The denominator is 
actually the standard deviation of the composite (p + q). The deduction is 
that if two parts correlate zero, when combined, the correlation of the part 
with the total will be equal to the ratio of the standard deviation of the part 
to that of the total. 

Index Correlation. This is usually called spurious index correlation for the 
reason that when indices such as JQ, EQ (educational quotient), or AQ 
(achievement quotient) are correlated with each other, r is markedly influ- 
enced by the fact that these ratios have in common such factors as chrono- 
logical age and mental age. IQ's from two different tests are derived from 
the MA’s obtained from the two tests each divided by the same СА. If there 
is a range of CA in the group correlated, this fact in itself introduces some 
positive correlation. 

Table 13.14 will show by means of a purely fictitious and overdrawn picture 
how this phenomenon works. For eight children who differ in chronological 


TABLE 13.14. DEMONSTRATION Or How INDEX Numpers May ACQUIRE A Нон 
DEGREE OF CORRELATION BECAUSE ОР A COMMON DENOMINATOR: 
AN Extreme CASE 


Child | Chronological 
A 5.0 8 140 160 
B 5.5 8 145 145 
С, 6.0 7 117 117 
D 6.5 7 123 108 
E 7.5 8 106 106 
F 8.0 8 88 100 
G 8.5 7 94 82 
H 9.0 7 78 78 


Correlation between mental ages I and II = ‚00 
Correlation between /Q’s I and II = .92 


age from five to nine inclusive, mental-age ratings on two different tests are 
given. These are obviously selected children, since their mental-age values 
hover at seven and eight in a haphazard manner. Note, however, how the 
Os spread, from 160 through 78. The spread in /Q’s is almost entirely due 
to the spread in chronological ages. Since each child has the same chrono- 
logical age for both IQ's, that same denominator of the ratio of his MA to CA 
assures that his 7Q’s will be about the same. Some IQ's go up together in the 
two tests for children of low C4 and others go down together, for children 
with higherC4. The correlation computed between IQ'sis.92. The same 
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sort of phenomenon goes on in the actual situation to a lesser extent when 
there is an appreciable range of chronological age. 

In the author's opinion, the term spurious is not to be confined to this type 
of situation in particular; for in a sense, all correlations are spurious to the 
extent that they are influenced by the conditions under which they were 
obtained. If one remembers what /Q’s are and interprets correlations 
between them accordingly, no particular falsification of the facts is in ques- 
tion. The important thing is that one should correlate variables in the full 
knowledge of how the measurements were obtained, if possible, and should 
report to his readers the facts needed for wise interpretation, whether it be 
variability of the correlated group or range of CA’s involved when /Q’s have 
been correlated. 

The real difficulty comes when investigator or reader takes IQ's to be some 
real, absolute properties of individuals, on the one hand, and when someone 
not oblivious to the common СА factor plays it up as a fatal source of "error," 
on the other hand. Both should remember the relative nature of all correla- 
tion coefficients. The important thing is that the wary investigator should 
not attribute his results to some supposed real nature of psychological or 
educational phenomena when some property of statistical treatment is really 
responsible. Nor will the sophisticated critic fail to grant the utility of cer- 
tain procedures shown to be fruitful under the circumstances of operation 
even when some “spurious” element has entered the picture. Errors, too, 
are relative matters. What is an error from the point of view of one frame 
of reference may be the truth when the frame of reference is changed. 

Correction in r for Errors of Grouping. If, in computing a Pearson r by 
means of grouping data in class intervals, a small number of classes either 
way has been used, the estimate of correlation is lowered to some degree. In 
the limiting case, of two classes each way, the computed r is about two-thirds 
of the r had there been no grouping. When the number of intervals is 10 
both ways, r is about 3 per cent underestimated. For any number of classes 
in X or in У, we can correct for the error of grouping by dividing r by a con- 
stant corresponding to that number of classes. 

The correction is necessary because errors of grouping yield overestimates 
of the standard deviations, as was pointed out in Chap. 5. If Sheppard's 
correction has been applied to both standard deviations, no further correction 
is necessary in the coefficient of correlation. 

Table 13.15 supplies the list of constants given by Peters and Van Voorhis 
to be used in making corrections in r.t Correction is made for the number of 
categories or intervals in Y as well as in X. The correction factors are used 
in the following manner. Suppose that we have an obtained r of .61 in a 
problem with eight intervals in X and nine in Y. The correction factors for 
these numbers of intervals are .977 and .982, respectively. The correction 

1 Peters and Van Voorhis, op. сй. P. 398. 
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is made by dividing the obtained r by the product of the two correction fac- 
tors. In terms of a formula, 


fe = — (Coeficient of correlation corrected for coarse grouping) (13.35) 
[2 


in which c, and c, are the correction factors for variables X and Y, respec- 
tively, based upon the number of class intervals in each. Applied to the 
correlation of .61 with eight and nine categories in X and Y, 


.61 


1. = GE = .626 (or .63) 


When there are the same number of intervals in both X and Y, the correction 
factor is the same for both, and the factor squared would be called for in the 
denominator of formula (13.35). The factors squared are given for this pur- 
pose in Table 13.15, 


TABLE 13.15. CORRECTION Factors FOR ERRORS OF GROUPING IN THE COMPUTATION 
or Pearson's r WHEN DISTRIBUTIONS ARE NORMAL AND MIDPOINTS OF 
INTERVALS STAND FOR CASES IN THE INTERVALS 


Number of inter- 


Correction factor. |. 
Squared correc- 
tion factor. 


When the number of intervals in either X or V is less than 10 it is good 
practice to apply this correction procedure, certainly when the number of 
intervals is eight or below. There is most to be gained in accuracy of esti- 
mate of r when the obtained r is large; little to be gained if r is small, particu- 
larly if the sample is small. 

It should be remembered that the correction factors given in Table 13.15 
are designed especially for the situation in which the midpoint of an interval 
is the index number for cases in that interval, the intervals are equal in size, 
and the distributions are normal. For other, less common situations, see the 
reference below.! 

Correction of Phi for Coarse Grouping. Since the phi coefficient is a product- 
moment estimate of correlation, the question arises as to whether it is ever 
subject to this kind of correction. This question should arise only when one 
or both variables are actually continuously measurable and we want a more 
realistic estimate of correlation that describes the relationship that exists 
when the variable is used in graded form. As to number of "intervals," we 


Peters and Van Voorhis, op. cit. P. 398. 
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have two each way when $ is computed. The index number for each interval 
is not the midpoint, however, but is the mean of the cases in the interval. 

If we can assume that the actual distributions of both X and Y in the popu- 
lation are continuous and normal, a Pearson r may be estimated from ꝙ under 
limited conditions. Those conditions are (1) ф is not greater than .4 and (2) 
p and ?' are within the range .3 to .7. The formula is 


њ= 6 (“er ) (Estimate of a Pearson r from ф) (13.36) 


where the symbols are as defined in Table 13.6, p. 305.! It will be noted that 
the multiplying factors are the same as in formula (13.14%). When a point- 
biserial r is wanted rather than a Pearson r, the estimate calls for only one of 
the multiplying factors—that corresponding to the one continuous variable.? 
If p and p’ are .5, formula (13.36) may be applied when ¢ is as high as .6. 

When the specified conditions are not met, it is best to estimate the Pearson 
by computing a tetrachoric r. 


Exercises 


1. Compute by the rank-difference method the correlation between the first 20 scores 
in any two variables in Data 8A, Interpret your result, and comment on the question of 
statistical significance of the coeflicient. 

2. Compute for Data 154 a correlation ratio for the prediction of Y from X. Find the 
standard error of the obtained eta. Compute a standard error of estimate. Apply the 
F test of linearity, with rz, taken as .629. Interpret all results. 

3. Find from the literature one or two applications of the correlation ratio. State how 
the author used eta, and give his reasons, if stated. Was a test of linearity applied? Make 
your judgment as to the effectiveness of the uses of eta in the cases cited. 

4. In the data in Table 14.6, combine the distributions of cases receiving marks of 
A, B, or C into a single composite distribution; also, in another composite distribution, 
combine those receiving marks of D and F. Compute for these data a biserial r between 
scores and marks. Find the standard error of r». Interpret your results. 

5. Compute a tetrachoric coefficient of correlation for Data 144. Determine whether 
or not the correlation is probably significantly different from .00. If the Thurstone dia- 
grams, or other comptuing aids, are available, find another estimate of the tetrachoric r 
for the same data. 

6. Cite some other fourfold tables found in this book to which the tetrachoric r applies. 
Cite some other tables to which it does not apply. Explain. 

7. Reduce to a fourfold table preparatory to computing a tetrachoric r the frequencies 
in Data 11B. Do the same for Data 8B and Data 154. 

8. Compute a phi coefficient for Data 114, using the different formulas provided in this 
chapter. Estimate a Pearson 7 for these same data, using the obtained phi coefficient. 
Also estimate a Pearson 7 by computing a tetrachoric r, and compare the two estimates. 


1 Guilford, J. P., and Perry, N. C. Estimation of other coefficients of correlation from 
the phi coefficient. Psychometrika, 1951, 16, 335-346. 

2 Michael, W. B., Perry, N. C., and Guilford, J. P. The estimation of a point biserial 
coefficient of correlation from a phi coefficient. Brit. J. Psychol., Stat. Sec., 1952, b, 
139-150. 
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9. Find in this volume or in other sources some examples of data in which phi would be 
the most appropriate correlation coefficient to compute. Give reasons. 

10. Find in the literature some examples of coefficients of correlation that might be 
regarded as spurious from some point of view. How did the author interpret them? 
How would you interpret them? 

11. Compute the following partial r's for Data 164: rfa. тз 781.2. Interpret 
each result. Which of these coefücients has the most psychological or practical meaning? 
Which theleast? Explain. 

12. In Data 164, tell which partial r’s it would be most enlightening to compute. 
Explain. 


Answers 


1. pia (parts I and II) = —.11; ps = .65. From Table L, ру: is insignificant; pss is 
significant beyond the .01 level. 

2. n = 660; oy = .053; су: = 5.06; F = 1.24. 

4. т = .637; on, = .086. 

5. feos-pi = -63; r, = .59 (from Thurstone’s diagrams); т, = .091 (when ғ, = .00). 

7. 


15-23| 0-14 
.140-.189 8 9 21-38 45 16 


.100-.139 20 14 6-20 13 39 


8. ф = —.096; = —.151; п = —.150, 
11. rais = .395; raa = .466; ғыз = .241, 


CHAPTER 14 


PREDICTION OF ATTRIBUTES 


One of the most important fruits of scientific investigation and one of the 
most exacting tests of any hypothesis is the ability to make predictions. So 
important is this topic that it deserves to have considerable space devoted to 
it. Particularly is this true for the reason that statistical reasoning is basic 
to all predictions. Statistical ideas not only guide us in framing statements 
of a predictive nature but also enable us to say something definite concerning 
how trustworthy our predictions are—about how much error one should 
expect in the phenomenon predicted. The practical significance of this can- 
not be questioned. The significance even for the scientific investigator is 
too often unrecognized or forgotten. 

It is the purpose of this chapter, and the next two, to illustrate the kinds of 
predictions the statistically oriented investigator makes and how he not 
only does not blind his eyes to his failures but brings them clearly into the 
light. 

General Types of Prediction. Although in this volume we have gener- 
ally emphasized measurement, we have had to recognize from time to time 
that complete measurements cannot be made and that data are sometimes 
obtained as merely classified in categories. The latter type of data we 
recognize as enumeration data, a rudimentary form of measurement. It 
is a matter of assigning attributes to cases rather than quantitative evalu- 
ations on a linear scale, for example, identifying individuals as to sex, race, 
political party, or criminality. Although such data are not allocated to 
linear-scale positions, we can still make predictions from them and predic- 
tions of them from other information. We thus have four cases of predicting: 

1. Attributes from other attributes—as when we predict incidence of 
criminality from sex, race, or religious creed 

2. Attributes from quantitative measurements—as when we predict 
criminality from scores on tests of ability or of behavior traits 

3. Measurements from attributes—as when we predict probable test 
scores from sex, socioeconomic status, or marital status 

4. Measurements from other measurements—as when we predict achieve- 
ment in school from JQ-test scores 

General Ways of Evaluating Accuracy of Prediction. Predictions are 


obviously sound if they prove to be correct. The degree of correctness is 
333 
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indicated by how often or how nearly we hit the mark. In the case of pre- 
dicting attributes, our success can be numerically indicated in terms of the 
percentages of “hits.” But a more accepted way among statisticians is 
to ask how much better our predictions are than if we had not used the 
information we have—in other words, if we had not tried to predict one 
thing from the knowledge of another but merely from a knowledge of the 
predicted population itself. A more crude way of saying it would be to 
ask how much better our predictions are than guesswork. But this does 
not mean pure guesswork, as we shall see later. 

In predicting measurements, whether from attributes or from other 
measurements, we ask a similar question. But whereas in predicting attri- 
butes for cases, we work in terms of the number of hits or misses, in predict- 
ing measurements, we work in terms of kow far on the average we have 
missed the mark. We compare this average deviation between fact and 
prediction with the average of the errors we should make without using the 
knowledge we did as a basis of prediction. 

Let us see in a preliminary way what this means. We can predict that 
a student’s mark in a course will be somewhere in the range from A to F 
inclusive, and most probably it will be a mark of C, which more students 
earn than any other mark. This prediction is made without knowledge 
of the student’s scholastic-aptitude score, and its margin of error is meas- 
urable in terms of the standard deviation of the distribution of marks of 
all students. If we used knowledge of the students provided by aptitude- 
test scores, we should predict some to earn marks higher than C and some 
lower than C. The average of our deviations between prediction and fact 
will now be smaller than the standard deviation of the distribution of all 
marks. The difference between these averages of deviations tells us how 
much the knowledge of aptitude scores has improved our predictions. 


PREDICTING ATTRIBUTES FROM OTHER ATTRIBUTES 


Predictions Can Be Made in Both Directions. As our first example of 
prediction of attributes from other attributes, let us consider the data in 
Table 14.1. Here we have the numbers of persons in a “depressed” group 
who responded by saying “Yes,” ^?" and “No” to the question, “Would 
you rate yourself as an impulsive individual?” and also the numbers of a 
group described as “not depressed.” The individuals in these two categories 
are the highest and lowest quarters of a sample of 1,000 students who were 
ranked in terms of a provisional scoring on a personality inventory. Table 
14.1 provides us with two prediction problems. We can attempt to predict 
the verbal response to the question, knowing whether the person is in the 
depressed or not-depressed group; or we can attempt to predict the group 
to which a person belongs, knowing what response he has made. Let us 
take the prediction of verbal response first. 
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TABLE 14.1. DISTRIBUTION OF RESPONSES TO THE QuEsTION, (Wourp You RATE 
YOURSELF AS AN IMPULSIVE INDIVIDUAL?" As Given BY Two EXTREME GROUPS 
OF STUDENTS 


Response 
Group 
Yes ? No | Total 
Depressed. т т 72 45 133 250 
Not depressed. 106 85 109 250 
Both sta 178 80 242 500 


The Principle of Maximum Likelihood. Considering first the depressed 
group by itself, we find that the largest number of them respond with “No.” 
Taking each member of the depressed group as he came along, we should 
predict for him the response “No.” If all 250 came up for inspection, we 
should be correct 133 times out of 250, or 53.2 per cent of the time. For 
other samples from the same depressed population, we should expect a 
similar ratio of correct predictions. This illustration sets the pattern for all 
predictions of attributes from attributes. The prediction always observes 
the mode or most frequent attribute in the segment of the population chosen 
at the moment. For the not-depressed group, the mode is also at the 
response “No”; hence that is our prediction also for them, and our per- 
centage of accuracy is 43.6 per cent, not so high as before but higher than 
if we had predicted either “Yes” or "^?" for this group. Such predictions 
follow the principle of maximum likelihood or maximum probability. Either 
a depressed or a not-depressed person in this population is more likely to 
respond “No” than anything else, and so that is our prediction. 

The Forecasting Efficiency in Predicting Attributes. How good are these 
predictions? Since we have predicted the same response for both depressed 
and not-depressed individuals, we suspect that knowing to which group the 
person belongs helps us little, if any, to predict his response. A comparison 
of the percentages of correct predictions, however, tells us that we can be 
more sure of our prediction of “ No” if the person is depressed than if he is not. 
But no matter from what group the person comes, our prediction is the same, 
and so it is as if we could make no use of the knowledge of his group affiliation 
for this purpose. 

Let us compare the number of successes of prediction made with and with- 
out knowledge of group affiliation. Taking both groups combined, we should 
predict for each person at random the response “No,” and we should be cor- 
rect 242 times in 500, or 48.4 per cent. In the two groups predicted sepa- 
rately, we found successes of 133 and 109, which combined give us 242 correct 
hits, or 48.4 per cent. We have thus gained no more accuracy in predicting 
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responses from a knowledge of group affiliation than we could attain without 
this knowledge. The forecasting efficiency in predicting response from knowl- 
edge of group is therefore just zero. The work of calculating forecasting 
efficiency may be seen more clearly if summarized as in Table 14.2. 


TABLE 14.2. PREDICTIONS OF RESPONSE FROM KNOWLEDGE OF THE GROUP MEMBERSHIP 


Predicted | Number | Per cent 


Group membership 
response | correct | correct 


No 133 53.2 

No 109 43.6 

242 48.4 

Correct without knowledge. 242 48.4 
Excess with knowledge 0 0.0 


The second prediction problem here is to reverse matters and predict group 
membership from knowledge of the response. All persons responding “Yes” 
we should predict to be members of the not-depressed group, since 106 actu- 
ally are, as compared with 72 who are not. Again the modal attribute is our 
prediction. For those responding “?” the prediction is membership in the 
depressed group, and so also for those responding No.“ The percentages of 
correct predictions are given in Table 14.3 for each response and for all com- 
bined. Altogether, there are 284 correct predictions, or 56.8 percent. With- 


TABLE 14.3. PREDICTIONS or GROUP MEMBERSHIP FROM KNOWLEDGE OF VERBAL 
RESPONSE TO THE QUESTION 


Number | Per cent 


Response 
correct correct 


Predicted group 


Not depressed 106 59.6 
Depressed 45 56.3 
Depressed 133 55.0 
284 56.8 
250 50.0 

34 13.6 


Correct without knowledge 
Excess with knowledge..... 


out knowledge of which response each person made to the question, but with 
knowledge that half the total population are depressed and half are not, our 
expected number of chance successes is 250. Our predictions with knowledge 
of responses yielded an excess of 34 or a forecasting efficiency of 13.6 per cent 
We can say that our predictions with knowledge of response to the question is 
13.6 per cent better than those made without this knowledge would be. 
Prediction Not Equally Good in the Two Directions. It is now well appar- 
ent that we can predict successfully group membership from knowledge of 
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responses in this problem, whereas we cannot predict response from knowl- 
edge of group membership. It is not always true, as itis here, that successful 
prediction is possible in one direction and entirely impossible in the other, but 
it is a quite common finding that prediction is better in one direction than in 
the other when two variables are concerned. It will often clarify thinking 
about predictive problems to keep this fact in mind. It is sometimes assumed 
by the uninformed that if A can be predicted from B, B can, in turn, be pre- 
dicted from A. Such an assumption is likely to lead the unwary investigator 
into logical and practical difficulties when it is seriously wanting in applica- 
bility. This is a more serious matter in dealing with attributes than in deal- 
ing with measurements, for in the latter case the predictability of one meas- 
ured trait A from a measured trait B is usually not very divergent from the 
predictability of B from A. 

The Sampling Procedure in Prediction of Attributes. The evaluations of 
predictions already given are meaningful and useful. ‘There is still the prob- 
lem of how significant the decisions based upon the sample may be for the 
population. This calls for application of sampling statistics. For this pur- 
pose we can adapt the use of chi-square, ¢, and ¢ tests, all of which have been 
previously described. Their application here contains some new features 
that need to be explained. 

The Cell-square-contingency Method. We can compute a chi square for 
the entire contingency table involved in the prediction problem, and that 
would be meaningful as an over-all index of significance of predictive value 
somewhere among the categories. As we saw in the previous examination of 
predictions, however, some predictions are apparently better than others 
within the same table. By breaking chi square down into components or, 
rather, by examining the contributions to chi square from the different 
categories, we obtain a more analytical picture of each one’s significant con- 
tribution to prediction. Table 14.4 shows the customary steps in the solution 
of chi square. The last segment of the table, in which are given the cell- 
square contingencies, is particularly to be noted. 

The chi square for the entire table is equal to 10.12, which, with 2 degrees 
of freedom, is significant just beyond the 1 per cent level. We next examine 
each column of the table, for the sum of the cell-square contingencies for that 
column (the column-square contingency) indicates the degree of significance 
to be attached to the category it represents. For the response “Yes,” the 
sum is 6.49. This may be regarded as a chi square for a two-cell table and 
tests the hypothesis that the depressed and the not-depressed groups should 
have responded “Yes” in equal frequencies to the question. With 1 degree 
of freedom, the departure from the hypothesis is significant almost at the 
1 per cent level of confidence. The square root of chi square with 1 degree of 
freedom is equal to /; hence? for this response is 2.55. For the other responses, 
“2” and “No,” the values are 1.12 and 1.54, both insignificant. Thus, we 
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have a decision as to the sampling stability of the gains in accuracy of predic- 
tion as given in percentage terms in Table 14.3. Those percentages are 59.6, 
56.3, and 55.0 for the three responses, respectively. Only the first seems 
significant. 


TABLE 14.4. DEMONSTRATION OF THE CELL-SQUARE-CONTINGENCY METHOD OF 
TESTING CONTRIBUTIONS TO PREDICTION 


Expected frequencies Discrepancies 


fo — fe 


Squared discrepancies 


(fo = f 
Can (fo — fe)? 7. 


Ves 7 No Ves 7 No Total 


P o. e er. 289 25 144 | 3.247 | 0.625 | 1.190 | 5.062 


Not depressed. . 289 25 144 | 3.247 | 0.625 | 1.190 | 5.062 


6.494 | 1.250 | 2.380 | 10.124 


С = .14 4 2.55 1.12 1.54 x 


As for the prediction of response from knowledge of group membership, the 
answer lies in the sums of the rows of cell-square contingencies in Table 14.4. 
These sums are the same: 5.06. With 2 degrees of freedom, they fail to be 
significant at the 5 per cent level, This outcome agrees with the decision 
based upon Table 12.2, where it was found that there were no excess correct 
predictions attributable to knowledge of group membership, depressed versus 
not-depressed. More accurately interpreted, the row sums indicate that the 
distribution of responses of 250 depressed individuals does not differ signifi- 
cantly from that of the 500 depressed and not-depressed combined. The 
same may be said for the not-depressed group. When both are considered 
together, however, their mutual departure from a common, hypothetical 
distribution (that of the 500 combined) is sufficient to yield a chi square of 
10.12, which is significant. The corresponding coefficient of contingency (C) 
equals .14, which is another index of over-all predictive value. Because the 
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chi square from which C was derived is significant at the 1 per cent level, so is 
C significantly different from zero correlation. 

Response Significance as Indicated by Phi. Another approach, which 
applies in the special situation'in which one of two categories is to be pre- 
dicted from knowledge of another variable in more than two categories, uses 
the phi coefficient. Here we are interested only in the prediction of depressed 
versus not-depressed group membership from knowledge of response to a 
question. A ¢ coefficient would be quite suitable to indicate the correlation 
of each response to the item with a two-category criterion. When there are 
more than two responses, as in the present illustration, we can validate each 
response separately, although it is, to be sure, just one item, because there is 
more than 1 degree of freedom. The validity of any one response, or its 
correlation with the criterion, does not automatically determine the validities 
of the others, though, of course, it will have some bearing upon that validity. 

The procedure is demonstrated in Table 14.5. ‘There we have three differ- 
ent 2 X 2 contingency tables, one for determining the ф for each response. 
When validating one response we group the others into one category. The 


TanLE 14.5. TESTING THE Basis OF PREDICTION PROVIDED BY EACH CATEGORY 
SEPARATELY BY MEANS OF CHI SQUARE AND PHI 


x! = 10.08, ф = .142 x? = 149, ф = 055 x? = 4.61, ф = .095 


two categories when validating response “Yes” are responses “Yes” and 
“Not yes,” and so оп. The ¢’s for the three responses are .142, .055, and 
.095, respectively. This is another basis of comparing the effectiveness of 
the three responses as discriminating between depressed and not-depressed 
groups. We cannot be very sure that the differences in size of ¢’s are signifi- 
cant, since we do not have standard errors of the ¢’s. We can test the 
hypothesis of zero correlation, however, by means of the chi squares, which 
are 10.08, 1.49, and 4.61, respectively. These are to be interpreted as very 
significant, insignificant, and significant, for responses “Yes,” “?,” and 
“No,” respectively. These chi squares come in the same rank order as the 
column-square contingencies (see Table 14.4) but they are somewhat larger 


than the latter. 


„ 
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The differences are to be attributed to a difference in operations. The sum 
of the three chi squares (10.08 + 1.49 + 4.61) obviously exceeds the sum of 
the three column-square contingencies, because each column is included more 
than once in the three 2 X 2 tables. There is a difference in meaning, also. 
In computing the phi coefficients, we have asked, “What is the predictive 
value of a selected response versus all other responses?" If we predict one 
group membership in this problem from the responses “Yes,” we automati- 
cally predict the other group membership for all other responses. We find 
that it paid to group responses ? and “No” together, but it definitely was 
not so profitable to group any other pairs of responses. The function of the 
“ P” response was much the same as that of the “ №” response. This could 
have been seen in the original table (Table 14.1), in which the directions of 
differences in frequencies were apparent. It was also apparent in that the 
same prediction was made from the two responses. The tests of sampling 
significance bear out those observations. We should obtain as much predic- 
tive value by treating responses ?“ and “No” as if they were identical as 
we should by giving them individual weighting, as shown by the fact that 
when we combine them the chi square (10.08) is about the same as for the 
entire contingency table (10.12) when the two responses are kept separate. 
This is also shown by the fact of insignificant ¢ for the fourfold table featur- 
ing the “?” response in Table 14.5. 


PREDICTING ATTRIBUTES FROM MEASUREMENTS 


We sometimes wish to decide, on the basis of known measurements, whether 
an individual should be expected to be in one category, for example, to havea 
certain attribute, or whether he should be expected to be in another. Some- 
times it is a matter of making placements in different categories in order that 
the individual may expect a better consequent adjustment or greater satis- 
faction, Such is the case when we attempt to predict success or failure for 
persons for whom we know certain test scores. This problem was solved in 
principle by Guttmann. Here the author will attempt to provide some 
workable procedures whereby such predictions can be made and their relative 
accuracy determined. 

Critical Points Dividing Distributions. In Fig. 14.1, we have two popula- 
tions, differing in mean, standard deviation, and in V. We wish to find a 
score on the scale of measurement that will give us the maximum accuracy 
of prediction, so that we may say of an individual whose score is higher than 
that point that he is probably a member of the upper group and of an indi- 
vidual whose score is lower than that point that he is probably in the lower 
group and, in so predicting, make the minimum number of mistakes. Let us 
call that critical point E. 

‘The Prediction of Personal Adjustment, New York: Social Science Research Council, 
1941. Pp. 2717. 
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According to Guttmann's solution, point E comes on the scale where the 
two distributions have equal ordinates—in other words, where the two curves 
intersect (see Fig. 14.1). At this point, persons with scores of this value are 
equally likely to be members of either group. Above this point, at any score 


Distribution for 
Attribute B 


Distribution for 
Attribute A 


A 

Fic. 14.1. Distribution of two hypothetical groups possessing two distinguished attributes, 
A and B, when measured on the same scale of some other variable. The aim is to predict 
for each person his attribute from knowledge of his score. For those with scores above 
point E we predict attribute B as being more likely; for those below Æ, we predict attri- 
bute A. 


there is greater likelihood that the person belongs in the upper group than 
that he belongs in the lower group. Below this point, at any score, there is a 
greater likelihood that the person belongs in the lower group. The terms 
upper and lower here apply only to relative position on the measuring scale. 
The two distributions are divided according to two qualities, or attributes 
and it is possession of those attributes that we are trying to predict. As we 
proceed along above point E, the probability that we are correct in our predic- 
tion increases, since the ratio of individuals having attribute B to the number 
having attribute A keeps increasing. At point B, which is the upper limit of 
the range of the A group, and above B, we should have absolute certainty of 
prediction so far as these particular populations are concerned. Likewise, 
below point A, where the upper distribution ends, we should be absolutely 
certain that no case possesses attribute B. But if the two populations are 
taken as wholes, the shaded portions stand for the proportions of individuals 
incorrectly predicted. The crosshatched section (of distribution A) repre- 
sents the A’s wrongly predicted to be B’s, and the stippled section (of dis- 
tribution B) represents the B’s wrongly predicted to be A's. All the B's 
above point E are correctly predicted. It is on the basis of these numbers of 
correctly and incorrectly predicted cases that we can judge the forecasting 
efficiency, as weshallseelater. First,letussee how point Ecan be determined. 

Locating a Critical Point for an Artificial Dichotomy. The principle upon 
which the point of division is made on the continuous variable is a variation 
of the principle of maximum likelihood. For scores above the critical value, 
the probability of a case’s being in the upper category is greater than .5. For 
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scores below the critical value, the probability of a case's being in the upper 
category is less than .5. 

The location of the critical division point depends to some extent upon 
whether the dichotomy is a genuine one or whether it is an artificial one based 
upon continuous measurements. There are several methods that can be used 
to solve the problem. Some apply to either kind of dichotomy, some to one 
or the other but not to both. We shall begin with methods that apply to the 
artificial dichotomy. 


TABLE 14.6. DISTRIBUTIONS oF SCORES IN A GENERAL ENGLISH EXAMINATION MADE 


BY STUDENTS RECEIVING VARIOUS Marks IN THE COURSE 
. tay f b grs ORS MM, T S a 


Scores A B c D F 

180-189 1 

170-179 1 1 

160-169 5 7 1 

150-159 7 13 3 

140-149 2 26 10 1 

130-139 2 34 24 5 1 

120-129 0 40 39 7 0 

110-119 1 21 81 13 3 

100-109 19 89 28 | 4 

90- 99 4 81 29 | 9 

80- 89 1 42 46 

70- 79 16 29 11 

60- 69 5 20 9 

50- 59 6 11 

40— 49 1 5 

30- 39 3 

20- 29 0 

10- 19 0 

0- 9 1 
Sums. 19 166 391 185 65 

| 

ы кн Ке vp АИ shes: 


As illustrative material, let us use the data in Table 14.6. A large group of 
students were given the same comprehensive final examination in freshman 
English. Fach instructor was at liberty to use the scores in this examination 
along with other measurements as he saw fit in deriving a final mark in the 
course for his students. Taking all marks collectively, for all students 
receiving a mark of F, a frequency distribution of their examination scores 
was set up. The same was done for students receiving marks of D, С, B, and 
A. These are the five distributions listed in Table 14.6 and shown graphically 
in Fig. 14.2. The amount of overlapping in ability as represented by exami- 
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0 20 40 60 80 100 120 MO 160 180 200 
Fic. 14.2. Distributions of scores received in a common final examination, for students 
receiving marks of A to F. 


nation scores among these five groups is noteworthy, but it probably repre- 
sents a not unusual situation where marks are determined in the customary 
manner. However that may be, let us say that students receiving F's are, 
in the judgment of the teachers, failing students, and those receiving D's are 
D students, etc. These five categories represent five attributes as judged by 
these instructors. Let us take as our problem the task of predicting what 
attribute will beassigned to students making certain scores in the examination. 

Graphic Methods of Locating the Critical Point, When the overlapping dis- 
tributions are plotted as in Fig. 14.2, if they are fairly regular in contour, one 
can immediately locate the points at which the two distributions intersect. 
Distributions for attributes F and D intersect just below a score of 60; more 
exactly, by inspection, at 57 or 58. In this approach, it would be well to 
locate the point between two whole numbers, because scores are obtained in 
whole numbers. In this case, we should predict an F for students making a 
score of 57 or lower, and a mark of D for those making a score of 58 or above 
(at least up to the critical point between D and C). Between D and C, the 
critical point, by inspection, seems to be at about 87, probably on the lower 
side. Thus, for scores 58 through 86, we should predict a mark of D. The 
next critical point seems to come between 124 and 125. The prediction of a 
C arises for scores 87 through 124. The critical point between B and A is 
almost impossible to determine but seems to lie in the region of 170 to 175. 
The small number of A's makes any solution of this kind uncertain. 

Should overlapping distributions be irregular in contour, particularly in 
the neighborhood of the intersection point, if the data are not too limited, and 
if the smoothing required is rather obvious, it would be well to resort to 
smoothing before the point of intersection is sought (see Chap. 3 for a descrip- 
tion of smoothing procedures). 
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This graphic method of determining a critical dividing score point may do 
for rough estimates when samples are large and contours oí distribution curves 
are regular. A better graphic procedure will be described next. It is not 
only rather useful in practical situations but demonstrates a more general 
conception of the prediction problem. 

Taste 14.7. FREQUENCY DISTRIBUTIONS OF ENGLISH-EXAMINATION SCORES FOR 


STUDENTS RECEIVING MARKS ABOVE CERTAIN Division POINTS; ALSO PROPORTIONS 
IN EAcH Upper CATEGORY AT DIFFERENT SCORE LEVELS 


2010 olo] © (6) | 0) (8) (9) 
P 7. Ja Po Sor pos fae | Pave Saved Фама 
180-189 1 1 |(1.00) 1 6. 00) 1 (100% 1 (l. 00) 
170-179 2 1 | (.50) | 2 |(1.00) 2| (.100)} 2 (6. 00) 
160-169 | 13 5 7288. 1-42 92 13 | 1.00 13 | 1.00 
150-159 | 23 7 | .30 | 20 87 23 | 1.00 23 | 1.00 
140-149 | 39 2 | .05 | 28 72 38| .97 39 | 1.00 
130-139 | 66 2 оз | 36 | .545 60 .91 65 985 
120-129 | 86 0 00 | 40 | .465 79 .92 86 | 1.00 
110-119 | 119 1 ot | 22 | .185 | 103 .87 | 116 975 
100-109 | 140 0 00 | 19 | .14 108| .77 | 136 97 
90- 99 | 123 0 00 4 | .03 85| .69 | 114 93 
80- 89 | 97 1:01 43| .44 89 92 
70- 79 | 56 0 | .00 16| .29 45 80 
60- 69 | 34 0 | .00 S N15 25 735 
50- 59 | 17 0| .00 6 35 
40- 49 6 0| .00 hla ia 
30- 39 3 0 | .00 
20- 29 0 0 | .00 
10- 19 0 

0- 9 1 


Л = frequency in distribution of all students combined. 

Ja = frequency in distribution of students receiving a mark of A. 

Pa = proportion of students in each score interval who received a mark of A. Pro- 
portions in parentheses are very uncertain owing to the extremely small samples from 
which they are computed. 

Ја = frequency in distribution of students receiving marks of A and B. 


Preparatory to the application of this method, the frequency distributions 
of Table 14.6 were combined in various ways as shown in Table 14.7. In this 
method we are interested in finding out from the data the probability that an 
individual who earned a score of a certain size will be in the upper of two 
groups. In column 1 we have the total composite distribution. In column 
2 we have the distribution of only those who received a mark of A. The 
probability of a student in any class interval on the examination receiving а 
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mark of A is indicated by the proportion of all those in that interval who 
actually did receive a mark of A. This is an empirical probability, derived 
from the sample data. We use it as an estimate of the population proba- 
bility. Not until we go down the column of frequencies in column 2 to the 
interval 160-169 do we find frequencies of a size that would give us much 
confidence in the accuracy of the proportion derived from them. In that 
interval, 5 out of 13 received an A, or a proportion of .38. In the interval 
150—159, 7 out of 23, or 30 per cent, received an A. The other columns of the 
table represent other division points as to upper and lower marking categories. 
In columns 6 and 7 we are interested in the proportions in the class intervals 
receiving a mark of C or above. 


higher category 
8888 р 82 
888888 
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Score in an English examination 
Fic. 14.3. Proportion of the students who are in higher letter-grade categories at each score 
level in a common freshman English examination. 


Figure 14.3 shows graphically the relation between these proportions and 
the various score levels. The midpoint of each interval is used to represent 
the interval. This figure demonstrates that the increase in probability of 
being in an upper of two categories on another variable (marks) is of an 
S-shaped form with different degrees of skewness. The skewness is related 
to the over-all proportion in the upper category and to the skewness of the 
total distribution. With large numbers in the upper category the skewness 
tends to be positive, and with small numbers the skewness tends to be nega- 
tive. The points are sufficiently in line that one can draw continuous curves 
through them by inspection (which has been done in Fig. 14.3), except at the 
tails of some of them where data are incomplete. 

While we are interested primarily in the score level at which the probability 
of an individual’s being in the upper category is exactly .5, it is important to 
note that these functions tell us much more than that. They tell the proba- 
bility at each score level of an individual’s being in the upper category. We 
can say that for a score of 120 there is apparently no chance of a student's 
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receiving an A, there are about 31 chances in 100 of his receiving a B or above 
(with no chance of an A, this amounts to the odds for receiving a B), and 
there are about 89 chances in 100 of his receiving a C or better. There is 
possibly 1 chance in 100 of his failing the course. А student with a score of 
70, however, has apparently no chance of receiving an A, or B, about 22 
chances in 100 of receiving a C or better, about 77 chances in 100 of receiving 
a D or better, and, conversely, 23 chances in 100 of failing. 

To determine the scores corresponding to proportions of .5, by this graphic 
solution the division points appear to be: between A and B, a score point 
between 171 and 172; between B.and C, a score point between 130 and 131; 
between C and D, a score point between 86 and 87; and between D and F, a 
score point between 57 and 58. The last two coincide with those read from 
Fig.14.2. The first is more accurately determined, though still rather uncer- 
tain. The estimate of a division of 130.5 between marks B and C differs con- 
siderably from the 124.5 that was read from Fig. 14.2. These comparisons 
alone tell us nothing about the accuracy of either method, except that they 
agree very closely (within one unit) on two and roughly on a third, with 
intolerable disagreement on the fourth. 

Before leaving the two graphic methods, it should be pointed out that a 
very important difference exists between them. In the first of the two, only 
two adjacent distributions are considered in determining the critical score 
that is to separate them. In the second, we consider all cases within the one 
letter-category distribution and all others above as being in the upper group, 
and we consider all cases within the neighboring letter-category distribution 
and all others below as being in thelowergroup. This kind of problem comes 
up only when there are several division points to be established; more often 
there are only two. In the latter instance, all the distribution in X is 
involved, just as itis in the second graphic method and as it is in the computa- 
tional method to follow. Not only does the second graphic method provide 
more stable values to work with because of larger subsamples but it also fol- 
lows better statistical principles as expressed in the development of the 
computational method. 

A Computation of the Critical Score. It has been demonstrated recently 
that for this type of problem predicting membership in one of two artificial 
dichotomies—a formula may be used to estimate the critical score. We 
must assume for this purpose that both the distributions (in X and in Y) are 
actually continuous and normal. The formula is 

2y\ (__o% F 
X. = M. + (2) (= = ж) tion into two categories in a (14.1) 
correlated variable) 

1 This method was developed by the author and W. B. Michael, and its derivation is 
described elsewhere: Guilford, J. P., and Michael, W. B. The Prediction of Categories 

from Measurements. Beverly Hills, Calif.: Sheridan Supply Co., 1949. 
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where M, = mean of the entire distribution, for those in the two categories 
combined 
proportion of the total population in the category having the 
higher mean score on X 
=D 
y = ordinate in the unit normal distribution at the point of division 
of the area under the normal curve with p proportion above it 
standard measure of the point at which the division just referred 
to occurs 
This normal distribution.stands for the dichotomized variable in the same 
manner as it does in connection with the computation of a biserialr. In fact, 
there is a close relationship between formula (14.1) and the formula for com- 
puting a biserial r (formula 13.7). There is an alternative formula for esti- 
mating the critical score: 
2 ; oe, ae 

xo (dur) "eminem ыз 

The latter version of the formula is applied to the computation of X, in the 
English-examination problem, with the work shown in Table 14.8. The 
four division points by calculation are 167.8, 130.2, 86.5, and 53.1. The 
second and third are within one unit of those found by the second graphic 
method. These findings, though very limited, suggest that the second 
graphic method may be superior to the first and that neither is very satisfac- 
tory unless there are a sufficient number of points on both sides of the .5 level 
to establish the proper location of the curve in the region of that important 
level. The labor involved in computation of X, by formula is probably no 
greater than that for the graphic methods and leaves nothing to guesswork. 
The graphic method does have one advantage, that it does not require any 
assumption about the distributions on the two variables. 

Accuracy of Predicting Artificial Categories. The evaluations of predic- 
tions of categories when they are made from measurements can be made іп a 
manner similar to those previously described. Our interest may be in the 
numbers and percentages of correct predictions (or in the numbers and kinds 
of errors) and in the gain in accuracy of prediction from the new knowledge 
possessed. 

As an illustration, let us take the example of the English-examination data 
as related to course marks. To note the accuracy of prediction in two 
categories only, we may use the division between the B students and above 
and the C students or below. The indications are that the best separation 
on the score scale should be between a score of 130 and one of 131. It is not 
possible to make an exact separation of the cases given in grouped form in 
Table 14.6, since the dividing score point comes within an interval. For the 
sake of applying the test of goodness of prediction, however, let us assume 
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that the 66 students are evenly distributed over the range 130-139, and that 
one-tenth of them would have a score of 130. This means about seven 
students, four of whom are in the A-B mark group and three of whom are in 
the C-D-F group. With these arbitrary, but minor, adjustments, we can 
arrange the entire sample of 826 students in a 2 X 2 distribution, as in Table 
14.9. 


TABLE 14.9. Summary OF Correct AND Incorrect PREDICTIONS or LETTER MARKS 
A AND B versus C, D, AND F, IN FRESHMEN ENGLISH FROM AN EXAMINATION SCORE 
Examination Score 


Per cent | Per cent in 


Score group Prediction КЕШЕН (oU group 
Above 130... . AorB 1 16.6 
130 or below. C, D, or F 83 

Total ssc: omiies ОТЕ 84.0 


There are several ways of interpreting this table. We can note that there 
were 132 errors of prediction. If we are interested in predicting marks from 
scores, with the division point adopted we should wrongly elect 90 to receive 
marks of A or B and we should wrongly designate 42 to receive marks of C, D, 
orF. In predicting the 185 who according to high scores should receive A or 
В we should be correct in 51.4 per cent of the cases. This does not seem very 
high accuracy, unless we compare it with the proportion of those with A and 
B marks in the entire group, which is 137/826, or about 16.6 per cent. In 
predicting the 641 to receive C or below, the accuracy of 93.4 seems very high 
until we realize that about 84 per cent of the entire sample received similar 
marks. In comparing the percentages of correct predictions with the per- 
centages of corresponding types of cases in the entire sample, we are going in 
the direction of the chi-square test, in which divergency of distribution in the 
row or columns from the distribution in the marginal frequencies is the indi- 
cation of departure from a random situation. A more interpretable index of 
the degree of divergence is the phi coefficient. In this problem, chi square is 
208.11, which is far above required significance levels. From this we find 
$ to be .50, which indicates the amount of correlation between marks and 
examination scores when both are dichotomized and used in that manner for 


prediction purposes. 
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We could test the accuracy of prediction in similar ways for each of the 
other division points. The fourfold tables of frequencies would tell their own 
stories, and $ would summarize the agreement between prediction and fact. 
The $ might vary somewhat from one division to another. In a multiple- 
category problem like this one, some might prefer to consider all five mark 
categories together and note, for each division point, how many errors in pre- 
dicting marks are one-place errors, how many are two-place errors, and so on. 
A two-place error, for example, would be predicting a B when a D was 
obtained. А 5 X 5 contingency table might be set up with the four critical 
scores as the division points between categories in variable X. In so far as 
the widths of categories on the score scale differ, a contingency coefficient, С, 
would be the summarizing index of correlation to use. 

The kind of study of errors of prediction will depend upon what information 
the investigator hopes to gain from the results. Whenever a procedure 
depending upon the counting of cases is used, it should be emphasized that 
rather large samples are needed for dependable comparisons. 

Locating a Critical Point in Predicting a Genuine Dichotomy. When the 
dichotomy is genuine, the graphic methods that were previously described 
apply. The division is at the point of equal likelihood, and the graphic 
methods satisfy that principle for the sample. Assuming that the sample is 
representative of the population, approximately the same division point 
should be effective in making predictions in the population. 

An example of data that may be treated as a genuine dichotomy is given in 
Table 14.10.1 The two categories are “alcoholics” and “nonalcoholics” 
defined in the clinical sense. The alcoholics were recognized by responsible 
agencies as problem drinkers. It can be argued that there is a continuum of 
degrees of tendency toward alcoholism, but clinically and administratively 
there is a rather definite categorization which divides the two. When in 
doubt about continuity it is best to treat a dichotomy as being real. 

Inspection of the distributions in the table shows that the possibilities for 
prediction are quite promising. The first graphic method, based upon over- 
lapping of the two frequency-distribution curves, with or without smoothing, 
gives a division point between scores 18 and 19. For any score of 19 and 
above we should expect to find more than half of the individuals in this sample 
alcoholic and for a score of 18 and below less than half alcoholic. The second 
graphic method gives the same result as the first. 

Before accepting this solution as the one we want, however, it is necessary 
to consider a new aspect to the prediction problem when we are dealing with 
qualitative categories. Second thought about the alcoholism data will sug- 
gest the idea that the distributions as given represent the general population 

! These data were adapted from a doctoral dissertation by M. P. Manson. A psycho- 
neurotic differentiation between alcoholics and non-alcoholics. Quart. J. Stud. Alcohol, 
1948, 9, 175-206. 
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of men very poorly. In the general population, the proportion of alcoholics is 
extremely small; certainly not 60 per cent, as the data in question show. The 
data were obviously not selected on the basis of stratification. In fact, for 
the purpose of the investigation, contrasting groups of about equal size were 
desired. Suppose that we had alcoholics represented in line with their pro- 
portion in the general population. When we came to apply the first graphic 
method, with relatively much smaller frequencies in that group, the intersec- 
tion of the curve with that for the nonalcoholic group would have been at a 
much higher score, if indeed it intersected at all. By the second graphic 
method, the proportions of alcoholics might have been less than .5 at all score 
levels. No solution by the principle of equablikelihood would then have been 
possible. Another type of solution is therefore called for, one less dependent 
upon the proportions of the two kinds of individuals in the general population, 
if the principle of equal likelihood is to be applied. 


TABLE 14.10. DISTRIBUTION OF ALCOHOLICS AND NONALCOHOLICS FOR SCORES ON AN 
ADJUSTMENT INVENTORY 


(1) Q) (3) (4) (5) (6) (7) (8) 
а Дн Percentage 
l Scores „ Propor- distributions Propor- 
in the in- tion Mon 
Magd, Icoholi ; E alcoholi 
Non- Alco- alcoholic} Non: Alco: ic 
Alcoholic ишы blechen “holies” | ТО 
66-71 0 1 1 (1.00) 0 0.5 0.5 | (1.00) 
60-65 0 6 6 (1.00) 0 3.0 3.0 | (1.00) 
54-59 1 13 14 .93 0.7 6.4 7:1 .90 
48-53 1 13 14 93 0.7 6.4 d d .90 
42-47 3 17 20 85 2.2 8.4 10.6 79 
36-41 3 33 36 92 2.2 16.3 18.5 88 
30-35 2 32 34 94 1.4 15 17.2 92 
24-29 9 32 41 78 6.6 1 22.4 705 
18-23 16 23 39 59 11.7 1 23.1 49 
12-17 36 24 60 40 26.3 1 38.2 31 
6-11 43 7 50 14 31.4 3.5 34.9 .10 
0- 5 23 1 24 .04 16.8 0.5 17.3 .03 
N 137 202 339 .596 | 100.0 99.9 | 199.9 
M 14.11 | 32.83 | 25.27 14.08 | 32.80 | 23.44 
0 10.41 13.93 15.61 15.45 


Assuming that we have qualitative categories, and that we are attempting 
to predict one quality or another, it would seem logical to treat the two as 
being of equal importance. In the data of Table 14.10 we may regard the 
mean of 14.11 as being characteristic of nonalcoholics as a species, also the 
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form of distribution they gave. This is true if there was no biasing of sam- 
pling within this group as such. Likewise, we may regard the distribution of 
scores for alcoholics as characteristic of their population. This suggests a 
solution which would allow the two species equal representation. To 
achieve equal representation we may convert the obtained frequencies into 
percentage frequencies. These appear in columns 5 and 6 of Table 14.10. 
Beside them, in column 7, are given the sums of percentage frequencies in the 


ole 
0 5 10 15 20 25 30 35 40 45 50 55 60 65 
Score on an adjustment inventory 


Fic. 14.4. Proportion of alcoholics at each score level on an adjustment inventory. The 
problem is to find that score point above which more than half have the property alcoholic, 


different class intervals, and in column 8 are given the proportion of alcoholics 
at each score level. The graphic solution based upon these is shown in 
Fig. 14.4, which yields a critical division point between scores 20 and 21. 
Following this approach we may say that with scores 21 and above the odds 
are greater than .5 that the individuals have the property of alcoholism and 
with scores of 20 and below the odds are less than .5 for this property. We 
shall consider later how many and what kind of errors this division point 
would entail. 

When the two category groups are equated for size, as in the method just 
described, a much simpler solution is possible in certain situations. If the 
two distributions on the continuous variable are both symmetrical and of the 
same dispersion, the critical point will be at the unweighted mean of the two 
category means (M, and M,). This would be true, also, if with equal dis- 
persions any positive skewness in the one distribution is compensated for by 
a like degree of negative skewness in the other. If all one wants is a division 
score and if these conditions are satisfied, the mean of the two means equally 
weighted will serve. For the data on alcoholism, the mean of the two means 
is 23.44. This is somewhat higher than the critical point determined by the 
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graphic method, because the two distributions differ markedly in dispersion 
and in skewness. 

Computation of a Critical Value Dividing Genuine Dichotomies. Without 
assuming any particular form of distribution for the continuous variable 
except that it be continuous, a critical value that will approximately satisfy 
the principle of equal likelihood may be estimated by the formula! 


x wet Є > f) ( ot ) (Critical value on X divid- 
с= Mz U ing cases into most (14.3 
Pq M, — M, probable categories) ( ) 


where M, = mean of all X values 
p = proportion of the cases in the category having the higher mean 
of X values 
9 НЕА 
M, = mean of X values for category higher on X 
M, = mean of X values for category lower on X 
co, = variance in the total distribution on X 
Let us apply this formula to the prediction of sex membership of high-school 
students from knowledge of hand-grip scores. For a sample of 171 boys and 
246 girls, the two means (M, and M,) were 37.35 and 20.68, respectively. 
The mean of all cases combined was 27.51. The variance of the combined 
group was 115.38. The proportions (? and q) were .410 and. 590. Applying 
formula (14.3), 


Boe 115.38 
e [855] [= = 108 
= 30,09 


This result tells us that students earning a score of 31 or above are more 
likely to be boys than girls; those with scores of 30 or below are more likely to 
be girls. 

An alternative formula requires less information. It reads 


ААА Е ; 2) (25 x) [Alternate to (14.3)] (14.4) 


where the symbols are as defined previously. While this formula is more 
convenient in computing, formula (14.3) is somewhat more meaningful, 

It will pay to examine (14.3) to see what may be expected as p varies and as 
M, — M, varies. First, note that the critical score is the mean of all the X 
values plus an increment. This increment is positive, and X. will be above 
the general mean when is less than 5. It will be negative, and X, will be 
below the general mean when р is greater than .5. The division of cases in 


1 From Guilford and Michael, of. cit. 
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making predictions is in the same direction as that in the population. When 
b = 55, the increment becomes zero and the critical value equals M. This 
fact is true regardless of the amount of correlation existing between X and 
the categories. When deviates very far from .5, the ratio becomes quite 
large and likewise the increment. The critical value may even go outside the 
distribution, which would mean that we would predict all cases to be within 
the category having the greater frequency. If 90 per cent of a population, 
let us say, are in the upper category, X, might go very low on the scale. If 
we predicted all, or nearly all, the cases to be in the upper category, we should, 
of course, make a very small number of errors. 

It is of interest to consider the relation of the increment to the amount of 
correlation between X and Y. The type of correlation appropriate here is 
the point biserial. The point-biserial r is proportional to M, — M, and 
inversely proportional to cz. This being true, it appears that the increment 
is inversely proportional to the amount of correlation. The higher the corre- 
lation, the nearer X, is to the general mean, М,. When the correlation is 
perfect, predictions should ordinarily be perfect. For predictions to be per- 
fect, the position of X. should be such that the proportion expected in the 
upper category coincides with p, the obtained proportion. As the correlation 
approaches zero, the critical value departs more and more from M, and 
assures the prediction of more and more cases in the more populous category. 
As fpi becomes zero, if p does not equal .5, the increment becomes very large 
and most predictions fall in the more populous group, if not all. Thus, the 
prediction is determined relatively more by knowledge of X when the correla- 
tion is large and by the knowledge of which category is more populous rela- 
tively more when the correlation is small, as we should expect. 

When Population Proportions Differ from Sample Proportions. Formulas 
(14.3) and (14.4) presuppose that the sample proportion is a good estimate 
of the population proportion. Application of the principle of equal likeli- 
hood depends upon this. In the case of the prediction of alcoholism from 
inventory scores, however, we know the population proportion of alcoholics 
is very far from the .596 that prevailed in the sample. In the general non- 
hospitalized population, the proportion might be less than 1 per cent. In 
a prison population or a hospital population, it would undoubtedly be 
greater than 1 per cent. In a psychopathic ward it would probably be 
even greater. How, then, should we apply the formulas? Shall we want 
to observe the principle of equal likelihood under all situations? We saw 
some doubt cast on its application earlier. Let us apply formula (14.4) 
to the data on alcoholism, assuming different population proportions for 
alcoholic addiction; proportions of .333 (one-third), .2, .1, and .01, as well 
as the .596 of the Manson study and the special case of p = .5. We do 
not have data derived from such populations, but if we assume that the 
means and standard deviations already found for the two categories of 
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persons hold for the general situation, we can estimate M; and c for popula- 
tions made up of the specified proportions. The data are given in Table 14.11. 

For the obtained proportion of .596 for alcoholics, the X, which would 
give the maximal number of correct classifications is 20.08. For an assumed 
proportion of .50, X, is 23.47, which is equal to M. when the two classes 
are equal in size. This differs from the value estimated by the graphic 
method in Fig. 14.4, which was approximately 20.3. The two may be 
expected to coincide, as was suggested previously, when the two distributions 
have equal dispersions and skewness. They do not satisfy this condition 
here. If alcoholics made up a third of the population in which predictions 


TABLE 14.11. ESTIMATION Or CRITICAL DIVISION Scores ror PREDICTING ALCOHOLISM 
AS POPULATION PROPORTIONS OF ALcOHOLICS Аве ALLOWED TO VARY 


are made, the X, should be at 28.95. If they made up only 1 per cent of 
the population, it would take a critical score of 312 to find the two kinds 
of individuals equally represented. This is, of course, well outside the 
practical range of scores. 

It is true that as the proportion of nonalcoholics increases, for the same 
critical score, 23, for example, the greater the numbers and percentages of 
mistakes (of the kind diagnosing nonalcoholics as alcoholics) that would 
be made. To reduce the number of mistakes one would move X, upward, 
as the results in Table 14.11 demonstrate. For practical use of the pre- 
dictive instrument, however, one would have to desert the principle of 
equal likelihood. Decisions then should be made taking into consider- 
ation the relative seriousness of the two kinds of errors. The principle 
of equal likelihood carries the implicit assumption that the two kinds of 
error are of equal importance. 

Effectiveness of Predictions in Genuine Dichotomies. 'The goodness of pre- 
diction of the type being discussed here can be evaluated in much the same 
manner as for the prediction of artificial categories. This is true, particu- 
larly, when there are stable and meaningful population proportions in the two 
categories. In view of the several qualifications mentioned above, however, 
the kind of evaluation will have to be adapted to fit the situation and to give 
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the most meaningful and pertinent conclusion. The point-biserial r is a 
general index of correlation that applies here. It will not give the kind of 
answer often desired in this connection. With a given critical value chosen 
for X, we have a fourfold contingency table, to which other tests, as described 
before, apply. 

Exercises 


1. Using Data 144, make predictions in both directions. Determine the percentages 
of correct predictions with and without knowledge of categories and the percentage of 
forecasting efficiency. Discuss the results, including the usefulness of the predictions, 


Dara 144. RELATIONSHIP BETWEEN FAILING IN COLLEGE AND BEING ABOVE OR 
BELOW THE MEDIAN IN HIGH-SCHOOL GRADUATING CLASS 


Failing in one | No failures in 


Status in high-school class Total 
or more courses | first semester 
Above the median. 37 340 377 
Below the median 49 71 120 
86 411 497 


2. Using Data 14B, make predictions of whether a student will report “Yes,” 7,“ or 
“No” to the question about talking when he makes similar responses to the question about 
walking in his sleep. What are the precentages of accuracy in these various predictions 
and in the over-all set of predictions? 


Data 14B. RELATIONSHIP BETWEEN WALKING IN ONE's SLEEP AND TALKING IN 
One’s SLEEP As REPORTED By 1,787 STUDENTS* 


Walk in your sleep? 
Talk in your sleep? 


? No Total 


8 400 497 
14 194 211 
: 3 | 1,060 | 1,079 
Brut AG. 26 | 1,663 | 1,787 


* Jenness, A. F., and Jorgensen, A. P. Ratings of vividness of imagery in the waking state compared 
with reports of somnambulism, Amer. J. Psychol., 1941, 54, 253-259. Reproduced with the permission 
of the editor of Amer. J. Psychol. 


3. Apply the cell-square-contingency test to Data 14B, testing predictions from different 
Sources. Make any combinations of categories that seem necessary. Compute chi square 
for the entire table. Draw conclusions. 

4. Find a critical total score which will subdivide the total group in Table 13.4 into the 
most probable categories (passing and failing). Use two graphic methods and a solution by 
formula. Discuss any discrepancies that may occur. 

5. Find a critical division point between boys and girls for the data in Fig. 15.1, which 
will make the best prediction of sex membership from knowledge of weight. Use formulas 
(14.3) and (14.4). Also assume equal proportions of boys and girls. 
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Answers 


1. Per cent of correct predictions of failures: 90.2 and 59.2; 82.7 for total; no excess over 
prediction without knowledge of high-school status. Per cent of correct predictions of 
high-school status: 57.0 and 82.7; 78.3 for total; 75.9 per cent without knowledge of failure, 
or an excess of 3.2 per cent. 

2. Per cent of correct predictions of talking: 89.8, 53.8, and 64.3; 65.5 for total; without 
knowledge of walking, 60.4 per cent, or an excess of 8.5 per cent. 

3. Combining the “Yes” and “?” categories for walking, cell-square contingencies for 
columns are 169.85 and 12.67; for rows, 121.67, 0.42, and 60.43. Chi square is 182.52. 
All are significant at the .01 level except predictions from the “?” category for talking. 
С = 304. 

4. Critical-score estimates: 78.5, 80.5, and 79.1. 

5. Critical score (using obtained proportions) : 63.7; (using equal proportions) : 62.2. 


CHAPTER 15 


PREDICTION OF MEASUREMENTS 


PREDICTING MEASUREMENTS FROM ATTRIBUTES 


The Principle of Least Squares. What would be the most accurate predic- 
tion of the weight of a sixteen-year-old youth? By “most accurate” we 
mean a weight that, if chosen to predict the weight of each sixteen-year-old 
selected at random from a certain population, would be closer to the facts in 
the long run than any other estimate would be. To state the matter in 
another way, we want a predicted weight that would give us the smallest 
average discrepancy from the actual weights. For every person, we should 
find the difference between his actual weight and our prediction in order to 
obtain the single discrepancy. 

Statisticians have good reason to deal here in terms of the squares of the 
discrepancies rather than in terms of the discrepancies themselves. They 
demand a predicted measurement from which the sum of the squared dis- 
crepancies isa minimum. The prediction that will satisfy this requirement 
has been proved to be the mean of the distribution. In choosing the mean as 
our prediction, we are following the principle of least squares. Whereas in 
predicting attributes we chose the mode of a distribution as the indicator that 
would give us the smallest percentage of error of placement of cases, in pre- 
dicting measurements, we choose the mean as the indicator, which gives us the 
smallest set of squared deviations from the predicted value. 

Predictions Apply to Selected Populations. In answering the question 
with which we started this discussion, the best prediction of the weight of a 
sixteen-year-old, any better knowledge being lacking, is the mean weight of 
the population of which he is a member, If we wanted this to cover all 
sixteen-year-olds, we should see to it that our distribution from which we 
derive our mean is made up of a large sample in which both sexes, all races, 
and all socioeconomic and geographic groups are proportionately represented. 
We might, however, confine the question to sixteen-year-olds from the United 
States. We might further confine it to high-school youths in one city, or, 
even further, to one particular high school. Whatever our restriction in 
population, the predicted weight will apply only (except by chance) to that 
kind of population. In fact, strictly speaking, it will apply only to the meas- 
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ured sample. Whenever we extend our predictions to samples beyond our 
known population, wealways do so at the risk of enlarging errors of prediction. 

Errors of Prediction Measured by the Standard Deviation. In a certain 
high school in a certain American city, a random sample of 51 sixteen-year- 
olds had weights distributed as shown in Fig. 15.1. For the sake of an illus- 
tration, we shall adopt the sixteen-year-olds in this high school as our popula- 
tion. What we say concerning predictions within this group will hold by 
analogy to larger, more inclusive populations. The mean of the 51 students’ 
weights is 61.9 kg., and the standard deviation is 13.2. If now the 51 stu- 
dents were listed in alphabetical order and without seeing them we used 


А 
H 
Н 
. 
H 
$ 
H 
Н 
: 
H 


T 77 
$65 чиге 
H 22 


Boys Girls Both 
Fic. 15.1. Distributions of sixteen-year-old high-school boys and girls for weight in kilo- 
grams. Each dot represents an individual. 


merely the knowledge of the mean, we should most nearly predict the actual 
weights if we wrote after each student's name “61.9 kg." The odds are about 
2 to 1, as the interpretation of « goes, that our errors would be no greater than 
13.2 kg. either way from the predicted weight. The of 13.2 kg. may there- 
fore be taken to measure our margin of error in predicting single cases within 
the sample, when prediction is based only upon knowledge of the mean. 
Any other prediction we might make for all the individuals would yield a 
larger margin of error, according to the principle of least squares. We should 
not be very proud of our accuracy of prediction in this instance, and for prac- 
tical purposes of making decisions for individuals where their weights are 
important factors, we should be seriously in error in many cases. But we 
could do less well in predicting the individuals’ weights if we did not even 
possess the knowledge of their mean. Even if we knew the mean of sixteen- 


360 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION ([сн. 15 


year-olds in general and used that as our predictive value, we should do worse 
than we did, unless the mean of this small population coincides with that of 
all sixteen-year-olds. In other words, by knowing one attribute of our popu- 
lation—a group in one American high school—and the mean that goes with 
that attribute, we reduce the error of prediction to some extent. 

Predicting Weight from Knowledge of Sex. Of the 51 cases in the popula- 
tion of sixteen-year-olds, 24 were boys and 27 were girls. Will it help to pre- 
dict more accurately if we know each individual's sex? It should, since there 
is a sex difference in weights. Though many girls are heavier than many 
boys, the averages are distinctly apart—67.8 for the boys and 56.6 for the 
girls. Using the attribute of sex to contribute toward the prediction of indi- 
vidual cases and following the principle of least squares, for each boy who 
came along we should predict his weight to be 67.8 kg., and for each girl, the 
prediction would be 56.6 kg. 

How much will predictions now be improved? The margin of error of pre- 
dictions for boys is given by the е of their distribution, which is 12.6 kg., and 
the margin of error for the girls is given by a ø of 11.3. From this informa- 
tion, we see that both boys’ and girls’ weights are more accurately predicted 
than before (when the margin of error was 13.2) and that the girls’ predicted 
weights are more free from error than are the boys’. 

As a matter of consistency with previous procedures, let us ask what the 
percentage of reduction in error of prediction is. For the boys, the change of 
-6 in the ø is 4.5 per cent, and for the girls, the change in ø is 1.9, or 14.4 per 
cent. 

The Standard Error of Estimate. There is a way of summarizing the 
margin of error for all cases combined. This requires the computation of a 
standard error of estimate. It is a kind of summary of all the squared dis- 
crepancies of actual measurements from the predicted measurements. In 
terms of a formula, the standard error of estimate is 


[z zy 
Фу» = AX. (Standard error of estimate) (15.1) 


where Y — measured value of a case we are trying to predict 

Y' — predicted value for the case 

N = total number of cases predicted 
The subscript in oy, tells us that we are predicting variable Y from variable X. 
Tn the illustrative problem, Y is the variable of weight, and X is the variable 
of sex difference. "The sum of the discrepancies squared (see Table 15.1) is 
7,288.1, and so 


gi, = Ed = 142.90 


Фу = 11.9 
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The standard error of the estimate, in predicting weight on the basis of knowl- 
edge of sex, is 11.9. Using only the knowledge that this is a particular group 
of sixteen-year-olds with a mean of 61.9, the error of estimate was given by a 
standard deviation of 13.2. The margin of error using the information sup- 
plied by sex difference is 90.2 per cent as large as that without using this infor- 
mation. The reduction in size of error of prediction is 9.8 per cent, which is 
rather small but represents some gain. 

In computing the standard error of estimate in this kind of problem, it is 
probably more natural to do so by finding the с” of the two part distributions 
separately and then combining them. They cannot be combined directly by 
simple addition or averaging. It is the squared deviations in the two groups 
that must be combined. The sum of the squared deviations in each distribu- 
tion can be found by the formula* 

D, = Na. G squares of discrepancies within one distribu- (15.2) 
where 2x2, = sum of the squared discrepancies between prediction and fact 
(or between measurements and the mean) in distribution A 
(one of the attribute distributions) 
N, — number of cases in distribution 4 
са = standard deviation of distribution A 
When these sums of squared deviations are obtained from all component dis- 
tributions (distributions 4, B, C, etc.), they may be combined by simple addi- 
tion to give Z(Y — Y^)*. In other words, 


f f di: ies in all dis- 
z(y-rre- У? (Sum ваза iscrepancies in all dis- (15 3) 


where №, = number of cases in any component distribution (distributions 
A, B, C, etc., in turn) and e, = standard deviation of the same distribution.* 

The work of computing Z(Y — Y^) for the problem on weights of sixteen- 
year-olds may be summarized as in Table 15.1. From here the computation 
of су is exactly the same as previously demonstrated. 


TABLE 15.1. SUMMARY OF THE COMBINATIONS OF Sums or SQUARES FROM 
DIFFERENT SUBSAMPLES 


Distribution о? Nw’ 


160.02 | 3,849.48 
127.69 | 3,447.63 
7,288.11 
2(¥ — Y) 


Boys.. 
Girls.. 


1 Cf. formula (5.8). à 
2]t will be recognized that Z(Y — Y^)? is essentially a sum of squares from which the 


within variance would be computed in analysis of variance (see Chap. 12). 
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Other Predictive Indices May Be Introduced. It should be added that 
other attributes may be brought into the predictive picture. For instance, if 
different glandular constitution has a definite bearing on body weight, for 
example, thyroid functioning, we could subdivide each sex group into two or 
three categories as to glandular condition. The mean of each new subgroup 
would then become the prediction for members of that group. The devia- 
tions of actual weights from these means would be smaller, and the new 
standard error of estimate would be reduced in size. 

If we were successful in singling out all the significant factors correlated 
with weight and could predict from all of them at the same time, theoretically 
we could reduce errors of prediction to approximately zero. We can probably 
never know what all the significant factors are from which weight can be 
determined, and if we did it might be impossible to assign all the attributes to 
each individual. We are here speaking of the hypothetical limiting case. 
Any improvement in predictions approaches that limit. From a practical 
standpoint, it is always a question of whether the trouble of uncovering and 
using new descriptive attributes is justified by the gains in predictive accuracy 
‘that result. 

Estimation of Errors of Prediction in the Population. The standard error, 
of estimate computed for the weight-prediction problem, strictly speaking, 
applies to the sample only. It is a biased estimate of the margin of error that 
would occur in making predictions beyond this particular sample but in the 
same population. To estimate the standard error of estimate for the popula- 
tion, we need, as usual, to consider degrees of freedom, unless the sample is 
large. The formula would be the same as (15.1) with the substitution of 
N — m for N, where m is the number of categories predicted from. 


(Y — Y) 
буз = s NIIT (Standard error of estimate corrected for bias) (15.4) 
With this formula applied instead of formula (15.1), the corrected standard 
error of estimate is 12.2 rather than 11.9. The corrected one is the more 
realistic one to use in making predictions outside the sample. 


PREDICTING MEASUREMENTS FROM OTHER MEASUREMENTS 


When both known and predicted variables are measured on linear scales 
and there is some relation between them so that predictions are possible, we 
havea much more complicated problem. A complete treatment of it involves 
correlation methods, regression equations, and other procedures. 

The Correlation Diagram. Our illustration of this kind of problem con- 
sists of two achievement examinations in a course on educational measure- 
ments. In Table 15.2, we have the two distributions grouped in class inter- 
vals and the measurements in each class interval broken down to form a 
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distribution of its own in the other test. The class intervals for test X are 
listed along the top of Table 15.2 and the class intervals for test Y are listed 
along the left margin. 


TABLE 15.2. PREDICTING Scores IN ONE TEST FROM KNOWN SCORES IN 
ANOTHER TEST 


Test X 
Test Y 


60-64 | 65-69 | 70-74 | 75-79 | 80-84 | 85-89 | 90-94 | 95-99 | f, | Mrow | row 
135-139 1 1| 97.0 |'—* 
130-134 1 1 0 1 3 | 83.7 | 6.61 
125-129 7 0 0 2 1 4 | 85.8 | 5.45 
120-124 T2 der er СҮ О 17 | 83.2 | 5.67 
115-119 7 5 7 2 1 22 | 78.6 | 5.72 
110-114 | 1 4 2 904 2 22 | 75.9 | 6.56 
105-109 | 1 1 2 5 1 10 | 74.0 | 5.56 
100-104 | 1 3 0 1 1 6 | 70.3 | 6.87 
95- 99 2 2 | 67.0 | 0.00 
iE „i 1 87 N 
M. 107.0 105.5 114.9 | 114.5 | 116.4 120.3 124.0 137.0 
с, 4.08 | 5.52 | 4.31 | 6.83 | 6.43 | 4.71 | 5.10 | —* 


* The standard deviation of this array is indeterminate. 


Prediction of Y from X. As usual, we have here a double prediction prob- 
lem: the prediction of a score in Y from a known score in X, and vice versa. 
Let us consider the prediction of Y from X first. For the individuals in any 
class interval in test X, the best prediction is the mean of the Y distribution 
in that column, in other words, the mean of the column (Me). For each 
column of Table 15.2, its mean is listed in the next to last row. For the first 
column, M. is 107.0. Any person receiving a score from 60 to 64 inclusive in 
test X will most probably earn a score of 107.0 in test У. The other means of 
the columns are similarly interpreted. It will be noticed that there is a 
general upward trend in ће Муз as we go up the scale in test X, though there 
are two inversions. In view of the small numbers of cases upon which these 
means are based, some inversions are not surprising. 

The margin of error in predicting Y from X in each column is indicated by 
the standard deviation of that column. The cis are listed in the last row of 
Table 15.2. They remain fairly constant, but the range is from 4.08 to 6.83. 
The significance of the variations in c; could be examined by making 7 tests 
(see Chap. 10). 

The entire picture of predictions and their margins of errors within columns 
is shown graphically in Fig. 15.2. The circlets show the positions of the 
column means, and the vertical lines running through them extend from — 1с, 
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to +10,. In each column, we expect two-thirds of the observed scores to lie 
within the limits of these lines. 

Standard Error of Estimate. In order to obtain a single indicator of the 
goodness of the prediction of Y scores 


140, 
from X scores, we may compute a 
standard error of estimate as we did 
130, before when predicting measurements 
3 from attributes. The work is best , 
3 120 organized as in Table 15.3. Forevery 
t My column, we list first Ve, the number of 
2 cases in that column. Second, we list 
8 по ] oe, the squared с of the distribution in 
s that column. Next we find the prod- 
100 uct of these two values for that col- 


umn. The sum of these products for 
all columns yields Z(Y — Y’)?, which 
60 10 80 90 100 we need for computing су. This sum 


Scores in test X i 
5 is 2,930.97. From here on the work 
Fic. 15.2. A chart showing the most follows formula (15.1). 


probable score in test Y corresponding to 


each midpoint score in test X, also the 2283097 
range between minus and plus one quie 2" Bc 33.6893 
standard deviation within each column. = 5.80 

Oyz = 5. 


The с of the entire distribution of Y scores is 7.85, so that there is a reduc- 
tion in variability of 2.05, or 26.1 per cent, a marked improvement in predic- 
tion, as such tests go. We may say that the forecasting efficiency for pre- 
dicting У scores from X scores as we did is approximately 26 per cent. 


TABLE 15.3. COMPUTATIONS OF THE STANDARD ERROR OF ESTIMATE or Y SCORES FROM 


X Scores 

Ne о?, Nat 
3 16.67 50.01 
10 30.45 304.50 
12 18.58 222.96 
26 46.63 1,212.38 
18 41.36 744.48 
12 22.22 266.64 
5 26.00 130.00 
z 2,930.97 


Z(Y — Y) 


Predicting X from Y. The predictions of X from Y are listed in Table 15.2 
under Му in the next to the last column. The most probable X score for 
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any interval of Y scores is the mean of the row. The margin of error of the 
predictions is given in each case by o, and these appear in the last column 
of Table 15.2. To complete the picture of these predictions and their o’s, 
Fig. 15.3 is presented. The standard 


error of estimate of the X scores, czy 140 ce e 
(note the order of x and y in the sub- ч 
script), is equal to 5.93. Since the total 130 ат тэ; 
с of the X scores is 7.60, the reduction К 
in error of prediction is 1.67, which is + Шы 
22.0 per cent. The forecasting effi- ES 120 
ciency in predicting X from Y is in g 
this problem somewhat lower than the 5 110 
forecasting efficiency (26.1 per cent) in & 
predicting Y from X.* — — 
The procedure for predictions by ro o 
using means of columns and rows is not 
used very much in practice. It was em- 855 70 80 90 100 
phasized here because of the principles Scores in test X 


it illustrates, principles that underlie the Fic. 15.3. A chart showing the most 
regression methods to be described next. probable score in test X for each mid- 
The reader will had that the main pring o Dome biore im WE Ey migo the range 
ciple for making predictions of measure- Beiden minus аца plus опе ВНЕ 
deviation within each row. 

ments still holds—the principle of least 

squares. He will also find that the principles for testing accuracy of pre- 
diction—the standard error of estimate and the percentage of reduction of 
errors—also still apply. New ways of estimating them will be shown and 
their relation to the coefficient of correlation will be explained. In addition, 
new ways of interpreting the usefulness of predictions will be demonstrated. 


REGRESSION EQUATIONS 


The Meaning of a Regression Equation. The main use of a regression 
equation is to predict the most likely measurement in one variable from the 
known measurement in another. If the correlation between Y and X were 
perfect (with a coefficient of +1.00 or —1.00), we could make predictions of 
Y from X or of X from Y with maximum accuracy; the errors of prediction 
would be zero. If the correlation were zero, predictions would be futile. 
Between these two limits, predictions are possible with varying degrees of 
accuracy. The higher the correlation, the greater is the accuracy of predic- 
tion and the smaller the errors of prediction. 

When we use the means of columns of a scatter diagram as the most proba- 

1 The c's of the arrays were computed here without applying Sheppard's correction. 
Had this correction been used, the os would have been smaller and consequently «yz and o 
would have been smaller. 
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ble corresponding Y values, we are actually predicting Y's only for the mid- 
points of intervals on X, or, stated in another way, we are predicting the same 
Y value for a certain range of values on X. If we have any desire to be more 
accurate than that, we should like to be able to make predictions for all values 
of X. This the regression line and the regression equation enable us to do. 
We found (see Figs. 15.2 and 15.3) that the means of the columns (and of 
the rows) tended to lie along a straight line, with some minor deviations from 
strictlinearity. We shall now assume that the best predictions of Y from X 
lie along a line that best fits the means of the columns when those means are 
weighted according to the number of cases represented in each one. This is 


140 3 
X £059 +1002 ji 
135 T——7—N 
Tof ГА 
130 t 7 ^ 
/ 
125 | 21 
1 |4 2 
zy 
2 
ү: RS 
|| 


90 
55 60 65 10 75 80 85 90 95 100 105 
Scores in test X 

Fro. 15.4. A scatter diagram for two examinations, with two regression lines represented 
and their equations. 
known as the line of best fit, or the regression line. When predicting X from 
Y, we have another such line for the regression of X on Y. The two regression 
lines for the achievement-test data will be found pictured in Fig.15.4. Only 
when a correlation is perfect will the two lines coincide throughout their 
lengths. The higher the correlation, plus or minus, the closer together they 
tend to lie. All such pairs of regression lines intersect at the point represent- 
ing the means of Y and X; in this case, they cross at X = 78.15and Y = 115.28. 

The Regression Equations and Regression Coefficients. From elementary 
algebra, the student should remember that the equation for a straight line, in 
general form, is Y = a + bX. Such an equation completely describes a line 
when a and b are known; they are the regression coefficients and must be 
obtained from the data we have. Leaving out of account for the moment the 
Coefficient a, we should have Y = bX, or Y equals b times X. We see from 
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this that b is a ratio, and i£ tells us how many units V is increasing for every 
increase of one unit in X. Y b were 2, then for every unit of increase in X, Y 
increases two units. If 5 = 0.5, then for every unit increase in X, Y increases 
a half unit. The b coefficient gives us the slope of the regression line, and it 
depends upon the coefficient of correlation and the two standard deviations, 
as in the formula 


т, 


by. = fyz % (Coefficient for linear regression of Y on X) (15.5) 


where бу, with the subscripts in that order, implies that we are predicting Y 
from X, and where this is also true for 72. 

When we want to predict X from Y, we have a different regression equation 
with a different 5, which is given by the formula 


ba = fa («) (Coefficient for linear regression of X on Y) (15.6) 
Й 


The coefficient of correlation is, of course, numerically the same in both cases, 
since ғу: = xy. But in each case, the b’s are different and are equal to r times 
the ratio of the standard deviation of the predicted variable to that of the 
variable predicted from. We frequently speak of the predicted variable as 
the dependent variable and of the one predicted from as the independent vari- 
able. The reason for this is that, in predicting У from X, we arbitrarily take 
any value of X that we wish at the moment, whereas the Y we predict from it 
is dependent upon what X we have chosen. Once we have picked out a cer- 
tain X, Y is immediately fixed by our regression equation. 

The regression coefficient a is merely a constant that we must always add 
in order to assure that the mean of the predictions will equal the mean of the 
obtained values. As б determines the Slope of the line, ay; determines the 
general level of the line. It is given by the formulas 

а = My — (M.) byz (The a coefficient in a linear regression equa- (15.7а) 
eee (15.70) 


where the first one concerns the equation for the regression of У on X and the 
second concerns the equation for the regression of X on Y. 

The derivation of the entire regression equation is more often accomplished 
by one composite formula, combining the derivations of a and 5 into one 
operation as follows: 

Y'-r (7) (X — М.) + M, (15.82) 
т 


(Complete statement of linear re- 
gression equations) 


* =r % (Y — My) + M. (15.80) ` 


1 For a derivation of formulas for finding regression coefficients, see Appendix A. 


Ш 
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We use Y" and X" here rather than Y and X to show that they are predicted 
rather than obtained values. Predictions and obtained values rarely coincide 
unless correlations are nearly perfect. 

Applied to the data of Table 15.2, we have 


ES 
Үү” = 61 e 60) & — 78.15) + 115.28 
= (61)(1.03)(X — 78.15) + 115.28 
= 630K — 49.23 + 115.28 
= 630Х + 66.05 
X= 61 Ga) (Y — 115.28) + 78.15 
= 5011 + 10.02 


Interpreting these equations, we may say that !“ increases .630 unit for 
every unit increase in X and that X" increases .591 unit for every unit increase 
in Y. One way of checking the accuracy of the solution of regression equa- 
tions is to substitute M, in the first one to see whether Y’ is the mean of the 
Y's and to substitute M, in the second to see whether we obtain M, as our 
prediction of X. 

Another check as to the accuracy of computation of the b coefficients is the 
equation 

by buy = т? (Relation of regression coefficients to 7?) (15.9) 


Tn other words, the product of the two b coefficients is equal to the square of 
the coefficient of correlation. In this instance 


(.630)(.591) = .3723 = .61° 
LJ 


The Concept of Regression. It may help in understanding the regression 
equations as given in formulas (15.8a) and (15.85) to take a glance at their 
origin. The idea of regression came first and the correlation method fol- 
lowed. It began with Sir Francis Galton, who was making some studies of 
heredity suggested by implications of the theories of evolution put forth by 
his even more illustrious cousin, Charles Darwin. 

When Galton studied the relation of heights of offspring to the heights of 
their parents, he began by preparing a scatter diagram, perhaps the first. In 
order to put parents and their children on a common measuring scale, he 
converted all heights to standard scores. As the reader already knows, this 
meant expressing each person’s height as a ratio of his deviation from his 
group mean to the standard deviation of that group dispersion. The unit 
for the offspring’s scale and also for the parents’ scale was then 1c. Figure 
15.5 shows the type of figure Galton drew. 

Galton next computed the means of offspring’s heights (in z scores) corre- 
sponding to certain fixed parents’ heights (in z scores). As we saw in the 
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example earlier in this chapter when the same operations were performed (but 
with raw scores), he found that the means of columns fell along a straight-line 
trend. To him, incidentally, one striking phenomenon was that the means 
of offspring's heights did not increase as rapidly as did the parents' heights. 
Each mean height of offspring deviated less from their general mean than the 
height of the parents from which they came deviated from their mean. This 
“falling back” of heights of offspring toward the general mean has been called 
the Jaw of filial regression. It is merely the phenomenon of imperfect correla- 
tion. Had the correlation between children and parents in height been per- 


Height of offspring 
& 
Y 
o 


30 -20 -lo 7 to +20 430 
My 
Height of parent 


Fic, 15.5. Diagram showing the relation of the Pearson product-moment coeflicient of 
correlation to the slope of the regression line when scores in both X and Y are in standard- 


score units. 


fect, the regression would have been as shown by the dotted line in Fig. 15.5. 
The correlation was actually about +.50, and the obtained regression line 
was as shown. 

Origin of the Coefficient of Correlation. Galton wanted a single value 
which would express the amount of this regression phenomenon in any par- 
ticular relationship problem. Karl Pearson solved the problem in terms of 
the formula to which his name is attached. The steps were somewhat as 
follows. Galton’s own idea was to use the slope of the regression line as the 
index of relationship, because the steeper the slope, the closer the agreement 
between two variables. The slope of the regression line in Fig. 15.5, as in any 
coordinate plot, is the ratio of the increase in Y corresponding to a certain 
increase in X. From the plot we see that as X changes 27 (from the mean to 
+20, as shown), Y changes only 1s. The slope is 14, or .5. This was Gal- 
ton’s coefficient of regression, which received the symbol r for that reason. 
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That symbol has remained. The Pearson r is the slope of the regression line 
when both Y and X are measured in standard-deviation units. Inm this case, 
it can be shown that 


- Z2,5. 


fys N (Pearson r from standard measures) (15.10) 


Li 


In other words, r is an average of all the cross products of standard measures. 
Derivation of the Regression Equations. Sincer is the slope of the regres- 
sion line when standard measures are used, the equation for this situation is 


Ey = Тубу (Regression equation with standard measures) (15.11) 


Here we use 2, with the prime to denote a predicted value as distinguished 
from the actual value. From this beginning, let us work toward the regres- 
sion equations in raw-score form [formulas (15.84) and (15.85). The next 
step is to express these standard measures as deviations, y’ and x. Since 
® * and zy = y'/a, (v, is the unit of the sy values as well as of the 2, 
values), the equation becomes 


fy; — (15.12) 


oy Oz 


If we multiply this equation through by су, we have 


[7 


y =з (®) x (Regression equation with deviation scores) (15.13a) 
or e y = bier (15.138) 


Equation (15.135) shows that the same b coefficient applies to deviation scores 
as that applying to raw scores [see formula (15.82). It also shows that since 
the means of x and y are zero, the regression lines will pass through both of 
them without having an a coefficient in the equation. 

One more step is needed to arrive at the raw-score type of regression equa- 
tions. Going back to equation (15.12), if we next convert x to its equivalent, 
X — М», and у to its equivalent, У’ — M, (M, is the mean of the Y’ values 
as well as of the Y values), we have 


Jo — 
Камиа E 2) (15.14) 


oy oz 


Multiplying through by c, we have 


Y'-M,- n, (3) * – м.) 
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And transposing My 
Y'= ry (9) (X — M3) + M, 


which is identical with formula (15.8a). 

Regression Coefficients from Ungrouped Data. When data have not been 
grouped in class intervals, the derivation of the b coefficient requires another 
formula, which reads 


NZXY — (2X)(2 i i í 
fle EST d: ME ic EN аи directly (15.15) 


When this formula is applied to the data in Table 8.3, we have 
_ 4,720 — 4,550 _ 
~ 6,240 — 4,900 — 
The a coefficient is obtained by means of formula (15.7a) and is solved as 
follows: 


by. 427 


а = 6.5 — (7.0)(.127) = 5.61 


The regression equation is therefore Y^ = 5.61 + .127X. The equation for 
the regression of X on Y can be obtained by similar operations, substituting 
Y for X, and vice versa, in formula (15.15). The solution for the illustrative 


problem is 
_ 4,720 — 4,550 _ 
be, = 5339 — 4,205 184 
and ax = 7.0 — (6.5)(.154) = 6.0 


Checking the b coefficients, byzbzy = (.127)(.154) = .0196 = 7*, which is in 
agreement with 7? as previously known (see Table 8 

Predictions from Regression Equations. As an illustration of how a 
regression equation is applied in prediction, let us assume some values of X 
and find the corresponding Y’ values. Because in the preceding methods of 
prediction we predicted Y’s corresponding to midpoints of the intervals of X, 
let us do the same here for the sake of comparison, remembering that we 
might have chosen any values of X that we pleased. Table 15.4 gives the X 
values and their corresponding Y' values. When X is 62, Y' is 105.1, and 
when X = 97, Y' = 127.2, etc. It is interesting to compare these particular 
predictions with the means of the columns, which are given in the third row 
of Table 15.4. The discrepancies will be found very small as a rule. Grant- 
ing that the column means are generally not very reliable because of small 
samples, we may feel more assurance in the Y^ predictions because they are 
determined from the trend of the entire data rather than by small samples in 
separate columns. The predictions of X' from Y are given in the second sec- 
tion of Table 15.4 and are compared with the means of the rows as a matter of 


interest. 
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TABLE 15.4. PREDICTIONS OF У FROM X AND X FROM Y py MEANS OF 
REGRESSION EQUATIONS" 
Y' = 0.630X + 66.05 


92 97 
124.0 | 127.2 
124.0 | 132.0 


72 77 82 
111.4 | 114.6 | 117.7 
114.9 | 114.5 | 116.4 


X’ = 0.591У + 10.02 


107 112 117 132 137 
X= 73.3 | 76.2 | 79.2 88.0 | 91,0 
Mow = 74.0 | 75.9 | 78.6 83.7 | 97.0 


* The data involved are from the two examinations correlated in Table 8.5. The means of the 
columns and rows are obtained from Table 15.2. 


As a practical means of prediction, a graphic method will often be the most 
suitable procedure. If the regression lines are drawn as in Fig. 15.4 on cross- 
section paper, for any value of X on the base line, one can follow vertically up 
to the regression line and note the corresponding Y value at this point. One 
can read to the nearest unit with sufficient accuracy for practical work. The 
drawing of the regression line is simple in that two points determine the posi- 
tion of a line. One point can be at the two means, which will serve for both 
tegressions, Another point for the regression of Y on X might be at X = 60, 
Y = 103.85; a third point, for checking purposes, might be at X = 100 and 
Y = 129.05. For the regression of X on F, points might be located con- 
veniently at V — 100, X — 69.12, and Y — 130, X — 86.85. 

Standard Errors of the Estimates. We previously saw (see Table 15.3) 
that the errors of prediction (Y — Y'intheonecaseand X — X'inthe other) 
can be squared, summed, averaged, and then the square root extracted in 
order to obtain the standard error of the discrepancies between observed 
values and predicted values. "There we computed the standard error of the 
estimate from the discrepancies themselves; here we shall see that it is not 
necessary to compute the errors of prediction. 

When we have predicted on the basis of regression equations, we can esti- 
mate the margin of error of prediction, as given by oye (or by czy) from the 
coefficient of correlation. The formulas are 


Tuz = Oy vi = ш (15. 164) 
and (Standard error of estimate computed from r) 
см = g VI — Fy (15.165) 


in both of which the terms are now well known. It will be seen that the two 
equations are the same except for the use of c, when we are predicting У and 
of о, when we are predicting X (for rys = rz). The two standard deviations 
are multiplied by the common factor VI — 7. This factor is always less 
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than 1.00 and gives us an estimate of the reduction in errors of prediction from 
knowledge of correlated measurements as compared to érrors of prediction without 
that knowledge. When r is zero, this element equals 1.00, and then oy, = cy, 
and oz, = oz. In other words, when r = 0, there is no basis for prediction. 
When r = 1.0 (or —1.0), the element reduces to zero, and so does the standard 
error of estimate. This coincides with the expectation that the margin of 
error of prediction is zero when the correlation is perfect. 


FLY 
£j 


90 
55 60 65 70 15 80 85 90 95 100 105 
Scores in test X 
Fic. 15.6. The line of regression of Y and X, showing the range of observed values expected 
in Y in separate categories of score values on X. Parallel dashed lines above and below 
the regression line at a. vertical distance of one standard error of the estimate each way, 
mark off the region within which we expect two-thirds of the observed values to be. 


Interpretation of an Obtained Standard Error of Estimate. The interpreta- 
tion of the standard error of the estimate when r is neither zero nor 1.00 is 
somewhat as follows. Like any standard deviation, dyz can be referred to the 
normal curve of distribution. For the examination problem, 


i Oye = 7.85 VI .3721 = 6.22 
and су = 1.60 V/I — .3721 = 6.02 


No matter in what part of the measuring scale we are predicting (within the 
range of obtained scores, naturally), we assume that the margin of error is the 
same. When we predict Y from X, the average dispersion of observed meas- 
urements about Y’ is given by a g of 6.22. We expect two-thirds of the 
observed cases to lie within the limits of plus or minus 6.22 from Y^. This 
‘situation is illustrated graphically in Fig. 15.6. There we have the regression 
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line, along which the predicted Y’s lie, and in dotted lines we have the limits 
of one oy, on either side of it. Had we plotted a point for every individual 
we should have expected about two-thirds of them to fall between the two 
dotted lines. To make a particular prediction, when X = 90, Y = 122.8. 
The odds are 2 to 1 that any individual whose X score is 90 will not fall below 
116.6 or go above 129.0. We could state other odds for a divergence of 27 
either way or any other distance. In all depends upon our purposes. 

We could prepare a similar diagram showing the limits of the middle two- 
thirds of the individuals about the regression of X on Y, and we could inter- 
pret the errors of prediction in a similar manner, It will be noted that the 
margin of error as given by oz, is 6.02, or 0.2 smaller in predicting in the other 
direction, i. e., X from У, but this is merely because c, is smaller than о, Тһе 
percentage of error is the same in the two cases. The ratio of v, to c, is exactly 
the same as the ratio of тл, tO az, and that ratio is given by the factor I — ғ. 
This factor we shall meet again with a name attached to it [see formula 
(15.21)]. 

The Regression Line as a Mean. One way of looking at the regression line 
is to regard it as a moving average, a moving arithmetic mean. Like the 
arithmetic mean of any sample, the regression line satisfies the principle of 
least squares. The regression coefficients are so determined by the data that 
the sum of the squares of the deviations of observed points from the line is a 
minimum. Other lines might describe the trend of relationship nearly as 
well, but only the one line satisfies the principle of least squares. It is reason- 
able that, if the line is a mean, the deviations from it should be measured 
by a standard deviation. That standard deviation is the standard error of 
estimate. 

Correction of a Standard Error of Estimate for Bias. In smaller samples 
(N is less than 50) it would be well to make a correction in zy. (or cz) before 
applying it to the population. The change can be made by the formula 


N 7 
vz = Oye | N 2 (Correcting су, for bias) (15.17) 


where M is the number in the sample. The correcting can be done as well in 
the original computation, as follows: 


к Тһе Reliability of a Regression Coefficient. The b coefficient in the regres- 
sion equation has its sampling error, like all statistics. This is estimated by 


1 For an excellent critical discussion of regression effects in research problems see Thorn- 


dike, R. L. Regression fallacies in the matched group experiment. Psychometrika, 1942, 


7, 85-102. 
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о, = — (15.19) 
с VN 
or by (Standard error of a regression coefficient) 
. e, A—r 
ob, = 5 CENA (15.20) 


The ch, would be the same, except for changing the x and y subscripts around. 
For our examination problem 


6.22 


Om = (7.500 0.3274) 088 


We may say that the odds are 2 to 1 that the obtained 5,. of .63 does not 
deviate from the population b by more than .088. There is very little 
chance that the true b coefficient here is zero. 


THE CORRELATION COEFFICIENT AND ACCURACY OF PREDICTION 


The chief index of goodness of prediction of measurements thus far in this 
discussion has been the standard error of estimate. It has been shown how 
the latter is closely related to the coefficient of correlation. As r increases, 
the standard error of estimate decreases. There are other ways in which r 
and some of its derivatives can be used to indicate accuracy of prediction. 
Three of the common derivatives are the coefficient of alienation, the index of 
forecasting efficiency, and the coefficient of determination. Each has its unique 
story to tell about the closeness of correlation between two things and about 
the utility of predictions. 

The Coefficient of Alienation. Whereas r indicates the strength of relation- 
ship, the coefficient of alienation, k, indicates the degree of lack of relationship. 
By formula, E 
k= м 1-7 (Coefficient of alienation computed from r) (15.21) 


Squaring both sides of this equation, we have т 
k=1-r’ 


And transposing, we have 
k? + т = 1.00 


Thus, although we might have expected k plus r to equal 1.00, it is rather the 
sum of their squares that equals 1.00. If r is .50, k is nol also .50 but .886. 
When r is .50, then, the degree of relationship is less than the degree of /ack of 
relationship. It is when r = .7071 that relationship and lack of relationship 
are equal, for k also then equals .7071. Then r* + K = .50 + .50 = 1.00. 
Other values of k for different sizes of r can be found in Table 15.5. Figure 
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15.7 shows pictorially the functional relationship between & andr. Students 
of mathematics will recognize the relationship r? + &? = 1.00 as the equation 
for a circle with a radius of 1.00. The diagram shows only positive values 
of r and . 


TABLE 15.5. INDICATORS OF THE ImporTANCE OF COEFFICIENTS OF CORRELATION 


100 (1 — key) В 
Ray Percentage reduc- Panes 85 
Tay Coefficient of | tion in errors of Pi ted 
alienation prediction of Y SEPE MM 
| from X d 

.00 1.000 0.0 0.00 
.05 .999 xb 0.00 
.10 .995 „5 1.00 
15 -989 1.1 2.25 
.20 .980 2.0 4.00 
.25 .968 3.2 6.25 
.30 .954 4.6 9.00 
+35 .937 6.3 12.25 
.40 917 8.3 16.00 
45 893 10.7 20.25 
.50 866 13.4 25.00 
55 835 16.5 30.25 
.60 800 20.0 36.00 
65 760 24.0 42.25 
.70 714 28.6 49.00 
225 - 661 33.9 56.25 
.80 - 600 40.0 64.00 
.85 527 47.3 72.25 
-90 -436 56.4 81.00 
.95 -312 68.8 90.25 
-98 * 100 80.1 96.00 
99 -141 85.9 98.00 
995 .100 90.0 99.00 
999 -045 95.5 99.80 


Sometimes we wish to stress the point of independence between two things 
rather than their closeness of agreement. In such instances, we present Ё as 
well as . Besides being related to >, k is also related to other indices of good- 
ness of prediction to be mentioned next. 

The relation of k to 7 is the same as that of the sine of an angle to the cosine of that 


angle, Values of Ё corresponding to known values of 7 can be found by using Table J 
in Appendix B. 


сн. 15] PREDICTION OF MEASUREMENTS 377 


The Index of Forecasting Efficiency. In the formula for the SE of the 
estimate, sy = с, V1 = r*,,, we can now see that the factor under the radi- 
cal, V1 — r. is really the coefficient of alienation. We could rewrite the 
formula as % = oe. If we were to multiply & by 100, we should have the 
percentage o is ofo,. Whenr = .61, as in our recent illustration, x = .7924. 
The SE of the estimate in this problem is 79.24 per cent of the observed dis- 
persion of observations. Our margin of error in predicting V with knowledge 
of X scores is about 79 per cent as great as the margin of error we should make 
without knowledge of X scores. For then we predict every Y to be the mean 
of the Y's, and the SE of the prediction then equals оу. The reduction of our 
margin of error is 100 minus 79.24, or 20.76 per cent. The index of forecasting 


10 


0 OF 02 03 04 05 06 07 08 09 10 
Coefficient of correlation (r) 


Fro. 15.7. Chart showing & (coefficient of alienation) and d (coefficient of determination) as 
functions of r (coefficient of correlation). 


efficiency is defined as the percentage reduction in errors of prediction by rea- 
son of correlation between two variables. The general, simplified formula is 


E = 100(1 — Vi — n) (Index of forecasting efficiency) (15.22) 
or E = 100(1 — k) 


The calculation of E is facilitated by Table 15.5, where many of the E 
values are given for corresponding r's. Inspection will show that r must be 
as high as about .45 before E is 10 per cent. When a test has a validity 
coefficient of .45, the size of errors of prediction, on the whole, is only 10 per 
cent less than that we should have without knowledge of test scores but with 
knowledge of the mean criterion measure. Taken at its face value, this does 
not seem much of a gain. There are situations, however, in which, as will be 
shown later, a gain of even less might be of practical importance. 
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Better tests, with validity coefficients of .60, have an E of 20 per cent, and 
still better tests, when 7 = .75, have an E of about 34 per cent. Although 
these efficiencies may also seem small, we must treat them in a relative, not 
an absolute, sense. It is probable that the efficiency of predictions based 
upon the average unsystematic interview is less than 5 per cent. With this 
as our base, the picture of efficiency of tests looks much better. 

Figure 15.8 shows graphically the functional relationship between E and r. 
The range of r’s from .3 to .8 is marked off as representing the level of validity 
coefficients usually found for useful predictive instruments in psychological 
and educational practices. Tests rarely show correlations greater than .8 
with practical criteria, and those correlating less than .3 are usually of limited 
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Fie. 15.8. E (index of forecasting efficiency) as a function of r. 


value when used alone. In a battery to which they make a unique contribu- 
tion it may still be worth while to use them. The corresponding limits on 
the scale of E are 4.6 and 40. 

The Coefficient of Determination. Another mode of interpretation of r is 
in terms of r?, which is called the coefficient of determination. This statistic is 
also sometimes symbolized as d. The coefficient r? gives us (when multiplied 
by 100) the percentage of the variance (see Chap. 5) in Y that is associated 
with or determined by variance in X. When r = .50, the percentage of the 
variance in F that is accounted for by variance in X is 25, or one-fourth. To 
account for half the variance of any set of measurements, the r with another 
variable would have to be. 707 1. The proportion of the variance in F mot 
determined by or associated with variance in X is given by Ё?, which is called 
the coefficient of nondetermination. These statements about determination of 
Y by X are reversible and apply equally well to determination of X by Y. 
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We should speak of determination of one thing by another, however, only when 
a causal relationship can be logically defended; otherwise the expression 
associated with or accounted for (by way of prediction) is better. In Table 
15.5, several of the 1007? values are given for corresponding r's. In Fig. 15.7 
is presented graphically the functional relationship between d and r. 

Predicted and Nonpredicted Variances. The coefficient of determination, 
as well as its relations to r, k, and other statistics, can best be clarified by 
introducing another new idea. The total amount of variance in the predicted 
variable, Y, we denote Бу с?,. We can think of this variance as being broken 
down into two independent components, the predicted and the nonpredicted 
portions. The predictions of У, which we have called У”, have their disper- 
sion and their variance which are denoted by oy and , respectively. The 
standard deviation су would be computed from the deviations of the pre- 
dicted values about the mean of the Y values, M,. The amount of nonpre- 
dicted variance is indicated by the square of the standard error of estimate 
(0%). This statistic is computed from the deviations of the obtained У 
values from the regression line (or from the predicted Y values). The two 
component variances of c?, are therefore 


oy = oy + oye (Component variances in the predicted variable) (15.23) 


If we divide this equation through by % we have everything in terms of 
proportions. 


2 кй i f 
су 0 + ov _ 1.0 lat ires as the sum of two propor- (15.24) 


The first term on the right, ', is the proportion of the variance іп Y that 
is predicted and the second term is the proportion of the variance that is not 
predicted. We have already defined r° as the proportion of predicted vari- 
ance and 2° as the proportion of nonpredicted variance. This means that 
r? equals c and that k? equals 0°, /с?,, and that r = y/o, and k = on. 
We therefore have some new concepts of r and k. We can say that r is the 
ratio of the dispersion of predicted values to the dispersion of obtained values 
and that 4 is the ratio of the dispersion of errors to the dispersion of obtained 
values. 
EFFECTIVENESS OF SELECTION TESTS 

Although the coefficient of correlation and its derivatives, k, E, rê, and 
тш are all accurate and meaningful ways of interpreting the goodness of pre- 
dictions, and they serve well for those who know how to use them, in some 
practical situations they leave something to be desired. To quote them to 
the layman may earn the investigator a cool reception and an empty stare. 
Even the statistically informed test expert may find it desirable at times to 
cast his conclusions in other terms. This is true, particularly, when we are 
dealing with selection tests. 
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Those concerned with the administrative problems of selecting personnel 
by means of tests find that a different kind of enlightenment is desirable than 
that provided by the statistics in question. It is one thing to know that by 
the use of this test score, or this composite score, errors of prediction are 
reduced 15 per cent. But what does this mean with regard to the number of 
applicants one must examine, and what proportion one must accept for train- 
ing in order to have a certain number of successful employees at the end of 
training? With the same number of applicants selected, how many more 
satisfactory ones shall we have with the aid of the selection test than we 
should have had withoutit? Even if we could get the employer to grasp the 
idea of the index of forecasting efficiency as an abstract indicator of amount 
of gain achieved by the test, to most laymen the Æ values actually reached 
by most test procedures sound very unimpressive, because laymen generally 
lack the proper experience to evaluate them. For these reasons, several 
suggestions have been made in recent years for more realistic and fruitful 
ways of evaluating selection tests. One of these will be described in some 
detail and the others mentioned in principle. 

Determiners of Effective Selection. Everything else being equal, validity 
coefficients (and statistics derived from them) are accurate indices of the 
effectiveness of selection tests. It has been pointed out, however, that the 
correlation of a test with a practical criterion is not the only thing to be con- 
sidered when practical decisions must be made. The practical utility of tests 
in any training or job situation depends upon other factors than the validity 
of the test or test battery. It depends upon the percentage of employees who 
would have succeeded if testing had not been applied in selection. It also 
depends upon the percentage of the applicants who are selected by means of 
‘the tests. 

The Taylor-Russell Method. Taylor and Russell have rationalized the 
problem in a clear manner. Following their exposition of the matter, the 
selection situation with tests is described in Fig. 15.9. The X axis represents 
the scale of test scores and the vertical axis represents the scale of the training 
or job criterion. Let us assume that the correlation between test and 
criterion is about .50. The ellipse describes the dispersion of individuals in 
this two-dimensional surface. On the X scale a point X, is marked. This is 
an arbitrary critical or qualifying score on the test. Individuals with scores 
above X. are selected and those with scores below X, are rejected. 

Without selection on the basis of the test, a certain percentage of the 
accepted applicants would have succeeded. We assume a continuous vari- 
able for the criterion as well as for the test. The point У, is an arbitrary 
critical criterion value above which the verdict is success and below which 
the verdict is failure. By drawing lines at X, and Y, parallel to axes Y and 


Taylor, H. C., and Russell, J. T. The relationship of validity coefficients to the 
practical effectiveness of tests in selection. J. appl. Psychol., 1939, 23, 565-578. 


сн. 15] PREDICTION OF MEASUREMENTS 381 


X, respectively, we divide the population into four kinds of individuals 
defined as follows:! 

A. Individuals who if Рет would succeed 

B. Individuals who would be rejected but who if allowed to compete would 
succeed 

C. Individuals who if selected would fail 
, D. ds who would be rejected and who if allowed to compete would 

ail 


A 
Selected 
end 


Success 


Job-proficiency criterion 


X 
8 
3 
e 
Rejected Xc Selected | 
Test score 


Ето. 15.9. Correlation surface divided by a critical score (X.), which separates the popu- 
lation into selected and rejected groups of individuals on the basis of test results, and by a 
critical criterion value (Je), which separates the same population into successful and 
unsuccessful individuals in a job assignment. 


Success Ratios and Selection Ratio. It is clear that the A and D people are 
correctly predicted under these conditions and the B and C people are incor- 
rectly predicted. We have thus reduced the prediction problem to one of 
prediction of (quantitative) attributes from (quantitative) attributes. The 
evaluation of predictions in this form could be carried out much as was 
described earlier. Here, however, the problem is much more. complicated, 
because we have to consider different division points on the success scale as 
well as different critical scores for selection on the test scale. In attribute- 
prediction problems the division points are usually fixed by the nature of 


things. 


1 The letter symbols—A, В, C, D—are defined somewhat differently than by Taylor 
and Russell. Here they have been made more consistent with the corresponding cate- 
gories—a, b, c, d—in the usual 2 X 2 contingency table. 
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We are now ready to consider two new concepts proposed by Taylor and 
Russell. One is the success ratio and the other the selection ratio. The suc- 
cess ratio is the proportion of accepted candidates who would be successful. 
There would be a certain success ratio without the use of selection tests, and 
another success ratio with the use of tests, provided the tests have any 
validity at all, and provided some selection occurs. The selection ratio is the 
proportion of all applicants examined who are accepted. In terms of symbols 
and equations, the success ratio without the use of tests is 


A+B AFB (Success ratio without the (15.25) 


Doer A+B+C+D N use of selection tests) 


where letters A, B, C, and D are defined as in Fig. 15.9. When there has been 
selection on the basis of a valid test, 


A 


S. = "EC (Success ratio with the use of tests) (15.26) 


The selection ratio is 


wide A+C 
fo авс Ер 


(Selection ratio) (15.27) 


Favorable Success Ratios (before Selection). A few examples will illustrate 
the fact that effectiveness of selection by tests depends upon the success ratio 
that would prevail without that selection. It is obvious that if all trainees or 
employees would be satisfactory without the use of selection tests, there 
would be little excuse for using them. The chances of improving matters by 
this approach would be nil, except as the quality or average production of 
satisfactory personnel were raised as a result. When the success ratio with- 
out tests is very low, there is much room for improvement, and with valid 
tests some improvement is bound to occur. 

Consider Fig. 15.10 in this connection. There four special situations are 
shown: cases of high and low test validity combined with high and low success 
ratios. In diagram I, the success ratio is high. One could move the critical 
score over a considerable range without changing the success ratio very much, 
until the selection ratio became very small. In diagram II, even where the 
correlation is high, a change in the cutoff score would disqualify very few 
potential failures, and eliminating even a few would result in losing manv 
more potentially successful candidates. In diagrams III and IV, the success 
ratios are very small. In diagram III, even a small number of rejections 
would disqualify many potential failures with little or no loss of potential 
successes. This is even more true where the validity of the test is much 
higher, as shown in diagram IV. In general, then, we stand to gain most when 
success ratios without testing are small. 
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Favorable Selection Rutios. If the number of applicants relative to the num- 
ber of places to fill is small there is, of course, not much opportunity for selec- 
tion. In the limiting case, if no one could be rejected there would be no use 
of selection tests, On the other hand, if there are many more applicants than 
places, and if one can then skim the “cream” off the top of the applying 
group, the chances of improving the quality of accepted personnel would seem 
to be great. This presupposes a method whereby the “cream” can be prop- 
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Fic. 15.10. Diagram similar to Fig. 15.9, with different combinations of selection ratio and 
success ratio (for definition of these ratios, see the text) and different degrees of validity. 


erly recognized. A valid test does that. But how valid must a test be before 
there is sufficient recognition of top talent? m 

Figure 15.10, diagram III, shows that even a test of rather low validity may 
be effective in skimming the “cream” in a negative way. That is, it can do 
much to eliminate failures. One could move the critical score a considerable 
distance and still reject several times as many potential failures as he would 
lose among potential successes. From diagram 1, however, we get the sug- 
gestion of a warning that if the qualifying score is set too high in a test of low 
validity we may be losing some of the very best qualified. We cannot press 
refined decisions of this kind too far on the basis of these diagrams because 
the populations are not uniformly distributed throughout the elliptical areas; 
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they thin out around the margins. The general tendencies, however, should 
be clear. 

It is clear from what has been said above that a test of low validity may be 
very useful in selection under a favorable set of conditions. Those conditions 
include certain combinations of success ratios and selection ratios. It can 
also be seen that even a test of high validity may be of little or no value if the 
conditions are unfavorable. Consider diagram II, in which the success ratio 
is very high. One could not eliminate many potential failures without losing 
many more satisfactory personnel. The higher the critical score, however, 
the more satisfactory the successful personnel would tend to be. It depends 
upon whether we are interested in numbers of successful individuals or in 
average quality. There are administrative questions of balance, also. It 
might be disadvantageous to take on at one time a whole class of prima 
donnas! 

Some numerical examples may be given to illustrate the points just made 
concerning favorable success and selection ratios. Let us assume a validity 
coefficient of .60, a typical value for good selection batteries. Let us also 
assume normal distributions in both test and criterion. If the success ratio 
So is .95, by rejecting 40 per cent of the applicants we could achieve a success 
ratio S, of .99. This is an improvement of only about 4 per cent over the 
results without the tests. Compare this with the index of forecasting effi- 
ciency which is 20 per cent when r = .60. To bring the S, up to 1.00, 
approximately, we would need to reject at least 60 per cent of the applicants. 
In either case, we reject about 10 applicants to gain one more successful indi- 
vidual. Rejections beyond 60 per cent would gain us practically nothing. 

Let the success ratio S, be .05, and what is the result? A rejection of 55 
per cent of the applicants would net an increase of .05 in the success ratio, a 
gain of 100 per cent. By rejecting as many as 95 per cent the S; could be 
raised to .30. This is a gain of 500 per cent. Compare these percentage 
gains with the index of forecasting efficiency of 20. 

To take less extreme instances of So, let us assume ratios of .80 and .20, 
with r still equal to .60. With the high S, of .80, we need to reject about 60 
per cent in order to raise S; to .95, a gain of 17.5 per cent. With the low Se of 
20, the rejection of 60 per cent yields a success ratio of .38, a gain of 90 per 
cent. 

A Graphic Chart of Relations of S, to Selection Ratio. Figure 15.11 shows, 
for the situation when the validity coefficient is .60, the change in success ratio 
S: as the selection ratio changes. Each curve represents a different initial or 
basic success ratio, So. Taylor and Russell provide tables which record these 
same relationships for various validity coefficients, and Guilford and Michael 
provide charts similar to Fig. 15.11 for other validity levels. 

Taylor and Russell, op. cit.; Guilford, J. P., and Michael, W. B. Prediction of Cate- 
gories from Measurements. Beverly Hills, Calif.: Sheridan Supply Co., 1949. 
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Indices-of-improvement Methods. In the Taylor-Russell method of test 
evaluation our attention is concentrated upon numbers and percentages of 
successful individuals. We ask what is the percentage increase in the num- 
bers of satisfactory personnel, without specifying anything about the degree 
of satisfaction. Much depends upon the placing of a passing point on the 
criterion scale and*an ignoring of the fact that success is a graded variable. 
In terms of planning in selection and training programs, particularly in 
military situations, where numbers of recruits may be liberal and standards of 
passable satisfaction are readily established, this kind of evaluation of a 


60 
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е selected who will be successful 


о 
ко 


Proportion of t 
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0 010 020 030 040 050 060 070 080 090 100 
Selection ratio or the proportion selected (ps) 
Fic. 15.11, Chart relating success ratio to selection ratio when the validity coefficient is .60. 


selection instrument or program is adequate and well adapted, There are 
other procedures, however, that concentrate more upon the fact of graded 
excellence in criterion measures, and which involve thinking in terms of work 
output of personnel. The worth of a selection program is established if we 
can demonstrate a certain percentage increase in production of some kind. 
Tf the criterion is measured in terms of absolute amounts of production of 
workers, we may ask, “What percentage improvement in production does 
test selection bring about?” The answer can then be balanced against the 
cost of the testing program. 

The Jarrett Method. Although the first suggestion for this kind of index of 
test evaluation was made by Richardson, a more useful procedure was devel- 


Richardson, M. W. The interpretation of a test validity coefficient in terms of 
increased efficiency of a selected group of personnel. Psychometrika, 1944, 9, 245-248. 
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oped by Jarrett.! With somewhat different symbols than those used by 
Jarrett, his index of improvement can be computed by the formula 


M,— M, (Percentage i. i f 
x ge improvement in output for 
Dcum ( с. а selected БО) (15.28) 


where ry = validity coefficient for the test and , = index of variability of 
criterion scores given by the equation? 


Ty 


M (Relative variability of measurements) (15.29) 
у 


Y= 
where M, = mean of test scores for the selected personnel 
M, = mean of test scores for all applicants 
M, = mean of criterion measurements 
c, = standard deviation of the criterion measures 
И we may assume that the criterion measures are normally distributed, the 
last term in formula (15.28) is equivalent tosthe ratio y/p, and we have 


эз X. (Percentage improvement in output for a selected 
1 = тул, P. group in a normally distributed criterion) (15.30) 


where p. = proportion of applicants selected and y = ordinate in a unit nor- 
mal distribution curve at a point marking off p proportion of cases. 

An inspection of formula (15.30) leads to some interesting inferences which 
agree with things already pointed out. With v and /p constant, I is entirely 
dependent upon the validity of the test and directly proportionaltoit. With 
Tye constant, J increases as v, increases. That is, the more variable the 
criterion measures with respect to their mean, the greater is the improvement 
resulting from selection. It is reasonable that if all workers performed 
equally well there would be little use to attempt to discriminate among them 
by means of tests. The better they can be discriminated in terms of indi- 
vidual output, the better the chance there is of differentiating among them 
by means of predictive instruments. The factor y/p, as will be seen in 
Table G, is larger as p approaches. OO and smaller as р approaches 1.0. When 
$ = 01 this ratio is about 100 times as large as when p = .99. This principle 
agrees with the one applying to the Taylor-Russell method: that the lower 
the selection ratio, the greater the benefit from selection.? 


Jarrett, R. Е. Per cent increase in output of selected personnel as an index of test 
efficiency. J. appl. Psychol., 1948, 32, 135-145. 

The statistic v, will be recognized as one-hundredth of the coefücient of variation given 
in Chap. 5. Here, as well as there, the measurements must be in terms of a scale with an 
absolute zero point. Piecework scores, dollar values of output, and the like qualify for 
the use of this statistic. Ratings would not qualify. 

For a table and chart based upon Jarrett’s method, see Brown, C. W., and Ghiselli, 
E. E. Per cent increase in proficiency resulting from use of selective devices. J. appl. 
Psychol., 1953, 9T, 341-344. 


сн, 15] PREDICTION OF MEASUREMENTS 387 


Evaluation in Terms of Cost and Utility. Berkson has recently developed 
a procedure which emphasizes a comparison of the utility of a test with its 
cost. Utility is defined as the percentage of potential failures that would be 
eliminated by the test. Cost is the percentage of potential graduates the 
test would eliminate, These definitions can be referred to Fig. 15.9. Utility 
would equal 100D/(C + D). Cost would equal 100B/(A + В). The 
indices are, of course, related to the positions of the cutoff score and to the 
success ratio. In comparing tests, Berkson uses a single index number based 
upon the average cost for all utilities. For details the reader is referred to 
Berkson’s description.! 


Proficiency in a job assignment 


75 x^ 100 125 x4 150 


Score (o equivalent) on 
an intelligence test 


Fic. 15.12, A curved regression of a job-proficiency criterion variable on the test-score 
variable X, showing that a high cutoff score may be needed in addition to a low one. 


Selection When Regressions Are Nonlinear. Previous discussions of 
selection by means of tests have assumed linear regression; the assumption is 
that, throughout the range, the higher the score, the greater the average 
criterion performance of the individual. We should not leave the subject of 
selection without considering the case of curved regressions. Figure 15.12 
shows in general form a type of regression that may be more common than 
has been realized. 

There has been a common conclusion in the industrial-psychology literature 
that individuals of high intelligence are likely to do less well at highly rou- 
tinized, repetitive tasks than individuals of lower intelligence. The effect 


1 Berkson, J. Cost-utility as a measure of the efficiency of atest. J. Amer, statist. Ass., 


1947, 42, 246-255. 


. 
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may be due to lack of interest and to boredom on the part of the highly intelli- 
gent person, but for predictive purposes we do not particularly need to know 
the reasons. The fact of curved regression is undeniable and should be 
recognized in selection. It is likely that when the whole range of intelligence 
is studied in relation to job proficiency of many kinds, there will be found an 
optimal intelligence level for each kind of job. Curved regressions are often 
overlooked because the investigator fails to plot scatter diagrams, or because 
he has a restricted range in his population. In application for jobs, there is 
often enough self-selection beforehand that a limited range appears for 
examination. The resulting regression is therefore often linear within that 
range, and some correlations are zero because in that range there is no upward 
trend in Y as X increases. In relating certain temperament-test scores to 
rated proficiency of administrators, for example, the writer has found a few 
undeniable signs of curvature, with the optimal score not at the top. Rela- 
tions of other temperament scores to job-proficiency measures in such routine 
tasks as.cigar wrapping and stocking pairing réveal optimal scores below the 
average, that is, toward the extreme ordinarily denoted as poor personality 
traits, { 

Wherever curvature such as that shown in Fig. 15.12 is indicated by the 
data, two critical scores may be called for. If a cutoff score were placed at 
Xa then all the personnel above that point are apparently about equally 
good in terms of job proficiency. If the cutoff point were moved up to Хе, 
however, there are individuals having scores at the upper end who are just as 
poor performers on the job as many below Ху. A second critical point at 
X» would eliminate the high-scoring but below-optimal performers. If 
selection were further restricted, it should be restricted from both directions. 

The problems of evaluating selection devices when regressions are non- 
linear are more complex than those we have already seen. None has been 
worked out for this kind of situation, but variations of methods already. 
described would serve. The fundamental principles would be the same. 


Exercises 


1. Using the data of Table 14.10, predict the most probable score in the personality 
inventory for alcoholics and nonalcoholics, and for the two combined. What is the margin 
of error of prediction as made in these three ways? 

2. Compute a standard error of estimate for the prediction problem in Exercise 1. 
What does it tell us? 

3. What is the most probable total score for the passing and failing students repre- 
sented in Table 13.4? What is the accuracy of prediction for each category? How much 
improvement from knowledge of category? 

4. For Data 154, find the best prediction of score in the Opposites test correspond- 
ing to each midpoint score in the Mixed Sentences test. Estimate the margin of error 
for each prediction and for the predictions taken as a whole. 


1 See the author's discussion of problems of validation of measures of interests and tem- 
perament, in Thurstone, L. L. (ed.). Applications of Psychology. New York: Harper, 
1952. 
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Data 154. A SCATTER DIAGRAM FOR Two MENTAL Tests 


Y (Opposites X (Mixed-sentences test in Army Alpha) 


test in Army 
Alpha) ifa 
36-38 1 
33-35 3 
30-32 14 
27-29 11 
24-26 17 
21-23 15 
18-20 22 
15-17 2 12 
12-14 1 8 
9-11 3 9 
6-8 
fe 


5. Find the two regression equations for Data 154. Make all possible checks as to 
internal consistency of your computations. 

6. Using the appropriate regression equation, make a prediction of score in the Opposites 
test corresponding to each midpoint score in the Mixed Sentences test. Compare these 
predictions with those obtained in Exercise 4. < 

7. Compute the two standard errors of estimate for Data 154. What are the amounts 
of predicted and of nonpredicted variance in Y? What are the proportions of these two 
kinds of variances here? 

8. Draw a diagram like Fig. 15.6 that applies to Data 154. Draw another diagram 
like Fig. 15.4 showing the two regression lines. 

9. Derive the statistics k, E, and у? for Data 154. Interpret these findings. 

10. Using formula (15.15), compute a regression equation for the first 10 pairs of scores 


for parts V and VI in Data 84. 
Answers 


1. Most probable score: 14.1; 32.8; 25.3. 
Margin of error (о): 10.4; 13.9; 15.6. 
2. oye = 12.6. 
3. Means: 98.3, 83.6; SD’s: 16.3, 16.2; dys = 16.2; improvements: 7.9 per cent, 8.3 
per cent. 
4. Mot 12.5; 14.2; 17.1; 18.2; 19.5; 23.2; 25.7; 28.2. 
в: 2,1; 3.1; 5.0; 7.1; 4.7; 5.7; 4.95 4.6. 
Gys = 5.07. 
5. Mz = 14.19, My = 21.65; 02 = 5.71, oy = 6.73; Y! = .742X + 11.12; X’ = .533¥ 
4 2.65; by buy = 3956 = rey. 
6. Y': = 11.9; 14.1; 16.3; 18.5; 20.8; 23.0; 25.2; 27.4. 
7. % = 5.24; ску = 4,44; % = 17,94; oys = 27.35; riy = .3956; k'ay = 6044, 
9. k = .78; E = 22.2; r* = .396. 
10. Mz = 22.9, My = 27.7; X, = .945X, + 6.06; XL = .651Х6 + 4.87; ыу = 6152; 


r? = 6147, 


CHAPTER 16 


MULTIPLE PREDICTION 


MULTIPLE CORRELATION 


Independent and Dependent Variables. Thus far we have been dealing 
with correlations between two things at a time and the prediction of some 
variable Y from another variable X, or vice versa. Actual relationships 
between measured things in psychology and education are by no means so 
simple as that. One variable is found associated with, or dependent upon, 
more than one other variable at the same time. When we can think of some 
variables as being causes of another one, or even when we merely want to 
predict that one from our knowledge of several others that are correlated with 
it, we call the one variable the dependent variable and the ones upon which it 
depends the independent variables. The independent variables are so called 
because we can manipulate them at will or because they vary by the nature of 
things and, in consequence, we expect the dependent variable to vary 
accordingly. 

Whether or not a certain color is liked depends upon several factors: its hue 
(whether yellow, red, or purple, etc.), its lightness (whether light, medium, or 
dark), and its chroma (saturation or density). The affective value of the 
color also depends upon its area, its use, and its background. We are here 
naming independent variables upon which the affective value of a color 
depends. In so far as each one is a determiner of agreeableness of color, it 
will exhibit some correlation individually with affective value. The size of 
any one of these correlations will depend upon the relative strength of that 
factor and also upon how well the other factors have been neutralized, as they 
should be in a good experimental situation. 

A Graphic Picture of Multiple Dependence. The idea of a dependence of 
one variable upon two others can be illustrated by Fig. 16.1. In that illustra- 
tion is shown how the dependent variable, success in pilot training, is related 
both to aptitude scores and to chronological age. It requires a three-dimen- 
sional figure to show the relationships. The vertical dimension represents 
the dependent variable. Here it is measured in terms of percentage of 
graduates—not an ordinary way of measuring, but it will, nevertheless, show 
the principles involved. The two independent variables are shown as sides 

390 
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of the base, at right angles to each other. The scale of chronological age is 
shown reversed for convenience, since the correlation between age and the 
training criterion was negative. Both independent variables are shown here 
in very coarse categories for the sake of a simpler diagram. 

By noting rows of blocks (left to right) we can see how graduation rate 
changes with age for a relatively constant level of aptitude. By noting the 
columns of blocks (front to back) we can see how graduation rate changes 
with aptitude score for a relatively constant age level. The term constant 


Fer cent graduating 


Fic. 16.1. A multiple regression with percentage graduating from pilot training as a func- 
tion of both chronological age and aptitude score. (Adapted from an unpublished report of 


Headquarters, AAF Training Command, Fort Worth, Texas.) 


covers an unusual range in this illustration, but with finer grouping on age 
and aptitude we should expect similar trends. It is obvious that the regres- 
sions for the criterion on aptitude are much steeper than those for the criterion 
onage. The difference would be even more apparent if we had the criterion 
in terms of a properly graded measurement scale, The correlation between 
aptitude scores and the criterion was much higher (approximately .55) than 
that between age and the criterion (approximately —.10). A very rough 
appreciation of the joint predictive value of aptitude score and age can be 
seen by noting the change of height from the lowest block (29.9 per cent) to 
the highest (90.0 per cent). This change may be compared with those 
changes across columns alone or across rows alone. From this comparison 
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we should expect better prediction from both independent variables than 
from either alone. 

The Coefficient of Multiple Correlation. When we are interested in the 
amount of correlation between a dependent variable and two or more others 
simultaneously, we are dealing with a multiple-correlation problem. The 
coefficient of multiple correlation indicates the strength of relationship 
between one variable and two or more others taken together. The multiple 
correlation is not merely the sum of the correlations of the dependent variable 
and the various independent variables taken separately. Obviously, there 
would be instances in which these would add up to more than 1.00. One 
reason is that independent variables themselves are usually overlapping 
(intercorrelated) and so duplicate one another to some extent. In this we see 
one important principle of multiple correlation. The multiple R is related 
to the intercorrelation of independent variables as well as to their correlation 
with the dependent variable. The interdependency of the determiners sug- 
gested for affective value of colors is probably not so apparent as in the case 
of factors related to achievement in college algebra. Here we can think of 
such predictive factors as intelligence-test scores and high-school marks, 
which being related duplicate one another.to some extent in predicting 
achievement in college algebra. Hours of study and interest also bear much 
in common and so are not completely independent determiners of success in 
algebra. 

A Multiple-correlation Problem. In Table 16.1 are presented some data 
that call for the multiple-correlation solution. Four of the variables (Xo, X, 
Xi, and Ху) are all measures of things that supposedly determine academic 
success in college freshmen. Х| is the dependent variable, or average fresh- 


TABLE 16.1. INTERCORRELATIONS AMONG FIVE VARIABLES, INCLUDING ONE INDEX 
OF SCHOLARSHIP AND Four PmEpiCTIVE Inpices (N = 174)* 


Variable 


X = arithmetic test in the Ohio State Psychological Examination, Form 10. 

X; = analogies test in the same examination. 

X. = an average grade in high-school work. 

X; = student interest inquiry (measuring breadth of interest). 

Xı = an average grade for the first semester in university. 

* These data were abstracted from the Ohio State Coll. Bull. 58, by L. D. Hartson, and have been 
used in this chapter by permission. 
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man marks. It is customary to designate the dependent variable by Xi, 
though some authors, less often, call it Xo. 

An examination of Table 16.1 shows that the analogies test and high-school 
average mark have the highest correlation, when taken alone, with Xi, 
whereas the interest score X; has the lowest. The highest intercorrelations 
come between Xs, Xs, and X.. All represent abilities of one kind or another, 
and their correlations with X; (interests) are generally lower. This gives 
promise that the interest scores will contribute something to the prediction of 
college marks that will not have been already contributed by the other varia- 
bles, and so it should pay to include X, in the battery of predictive indices, 

As a matter of experience in psychological and educational predictions, 
it has been a common finding that it rarely pays to bring into a multiple- 
prediction situation more than four or five independent variables. By 
the time that this many are combined, they have fairly well covered what 
any additional one can do for us. This is partly a consequence of the fact 
that good human qualities tend to go together (to be intercorrelated) and 
partly that our predictive indices tend to remain in the same area of abilities, 
also ignoring personality factors, physical factors, and external circumstances. 

The Solution of a Three-variable Problem. We first take the simplest 
case of multiple correlation, that between the dependent variable and two 
independent variables. In the general problem given by the data in Table 
16.1, we may ask what is the correlation between freshman marks on the one 
hand and the two variables analogies-test scores and high-school averages on 
the other. The simplest general formula for this case is 


2 22 — ware of coefficient of multi- 
Aint rta nsnm 41 correlation wich three (16.1) 


2 = 
Fw 1- rh variables) 


where Ri = coefficient of multiple correlation between X , and a combina- 
tion of X? and Xa. 

Be sure to notice that this formula merely gives us Ri, the square root of which 
is R. 

The immediate example we have set for ourselves is to find Rid rather than 
Rias. To use formula (16.1), we need merely to substitute the subscripts 3 
and 4 for 2 and 3. The solution is 

t (583) + (.546)* — 2(.583)(.546)(.396) 
Rh = ыле сү 7 — (396 
339889 + .298116 — .252108 

i 1 — .156816 
= 45766 
Riu = 677 


The Multiple-regression Equation. We also have here a prediction prob- 
lem of estimating X; values from both X;and Xa. This calls for a regression 
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equation that involves all three variables, in other words, a multiple-regres- 
sion equation. From such an equation, we can predict an X; value for every 
individual. The correlation between these predicted values (X1) and the 
obtained ones (Xi) would be .677. This is another interpretation of a coeffi- 
cient of multiple correlation. 

For the three-variable problem, the regression equation has the general 
form XI = a+ БХ + БХ. As in previous regression equations, the 
coefficient a is a constant and must be calculated from the data. Its function 
is to assure that the mean of the X 1 values coincides with the mean of the X, 
values. The b coefficients serve the same purpose here as in the simple, two- 
variable equation. The coefficient bias is the multiplying constant, or 
weight, for the X» values, and bi is the weight for the X; values. The value 
of bız, tells how many units XI increases for every unit increase in X», when 
the effects of X; have been nullified or held constant. The value of 518.2 tells 
how many units X, increases for every unit increase in Хз, with the effects of 
X» removed from consideration. 

The particular б weights, as computed by the formulas given below, are the 
optimal weights. They assure the maximum correlation between predicted 
and obtained values. The solution, with the obtained b weights, satisfies the 
principle of least squares in that the sum of the squares of discrepancies 
between the X; values and the XI values will be a minimum. 

Solution of the b Coefficients. We do not find the 5 coefficients directly from 
the correlations but do so indirectly through the so-called beta coefficients. 
Beta coefficients are called standard partial regression coefficients—standard, 
because they would apply if standard measures were used in all variables; 
partial, because, as in the case of the coefficient of partial correlation (see 
Chap. 13), the effects of other variables are held constant. The bus and 513.2 
are known as partial regression coefficients, because they, too, are weights that 
presuppose that other independent variables are held constant. They are 
given by tle formulas 


bia = (©) 612.3 (16.24) 


and (Partial regression coefficients) 


8 
ъз = (©) 613.2 (16.25) 
The betas are found by the formulas 


— 112 — Tia 
Вз = тей (16.3а) 
апа (Standard partial regression coeflicients) 


= 718 — 12723 
Bin Tes (16.3) 
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Similar equations apply, with change of subscripts, when the independent 
variables are X; and X, instead of X; and Ху. In our example 


583 — (.546)(.396) _ 


Bia = 1— (396): = 435 
and Виз = — 14689007 = 374 


We can now solve for the 5 coefficients by means of formulas (16.22) and 
(16.25): 


9.1 
bisa = туту (439) = 233 
d bue = (374 175 
bs BAD somit dtd E 


For the complete regression equation, the a coefficient is still lacking. It is 
given by the general formula 


a = Mı — 122 — 513.2473 (16.4) 
Inserting the known values 
а = 73.8 — (.233)(49.5) — (.175)(61.1) = 51.58 
The complete regression equation will then read 
Хү = 51.58 + .233X3 + A75X4 


To interpret the equation, we may say that for every unit increase in X;, 
X; is increasing .233 unit and that for every unit increase in X4, Х, is increas- 
ing .175 unit. To apply the equation to a particular student whose Х score 
is 25 and whose X, score is 32, we predict that his X; score will be 


X; = 51.58 + 5.82 + 5.60 = 63.00 


We use Xi to stand for his predicted average freshman mark, because he has 
an actual average mark that we call XI. Some other examples of individual 


TABLE 16.2. SOME PREDICTIONS OF SCHOLARSHIP MARK FROM MEASURES IN 
Two VARIABLES 


Student 


X; analogies score. . 
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students are presented in Table 16.2 to show how various combinations of 
values for X; and X4 point to corresponding values of Xi 

Multiple Predictions by a Graphic Method. A graphic method of making 
predictions of scores in X; from different combinations of scores in Хз and X. 
is shown in Fig. 16.2. The chart is drawn to apply to the prediction of aver- 
age freshman grades from scores in the analogies test and high-school average. 
Diagonal lines are drawn in the figure, each representing the locus of identical 
predicted values. These lines represent X{ scores at intervals of 5 units. 
Note, for example, the line for X; = 70. A prediction of 70 may arise from 


Xs 
о 10 20 30 40 50 60 70 80 90 ' 


NENA 


. 
o 10 20 30 40 50 6 70 80 30 100 
Score in M (Analogies test) 
Fto. 16.2, Diagram showing constant values in the dependent variable for different com- 


binations of scores in two independent variables, each weighted as called for by the multiple- 
regression equation. 


many different combinations of Хз and X. Choose several values, in turn. 
in the analogies test, for example, 10, 30, 50 and 70. Corresponding wales 
in high-school average needed to yield predictions of 70 are 92, 65, 38, and 12 
respectively. The chief use of the chart, however, is to find xi for ns piven 
values in X; and Ху. For an X; of 20 and an X, of 50, the prediction is 
exactly 65. For an X of 90 and an X, of 14, the prediction is exactly 75 
When the prediction is not exactly on one of the diagonal lines, we inferpo- 
late, by inspection, between two lines. Thus, for X; — 40 and Xi — 70, the 
most probable xi is 73. The proportion of the distance between two Siero 
lines must be estimated by the perpendicular distance between them. The 
perpendicular is in a diagonal direction. The reader may get further practice 
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in using the chart by verifying the predictions found by computation in 
Table 16.2. 

Calculating the Multiple R from Beta Coefficients. If the beta coefficients 
are known, the shortest route to the multiple R is by way of the equation 


К, зз = 12.2712 + Bissris (16.5) 


Again, note that this gives R?, from which the square root must be obtained. For 
the scholarship data and variables X; and X., 


К°. = (.435)(.583) + (.374)(.546) 
457809 
Кың = 677 


as was found by formula (16.1) previously. 

Interpretation of a Multiple R. Once computed, a multiple R is subject to 
the same kinds of interpretation, as to size and importance, as were described 
for a simpler. One kind of interpretation is in terms of R°, which we call the 
coefficient of multiple determination. This tells us the proportion of variance 
in X; that is dependent upon, or associated with, or predicted by X; and X, 
combined with the regression weights used. In this case, R? is .4578, and we 
can say that 45.78 per cent of the variance in freshman marks is accounted 
for by whatever is measured by the analogies test and by high-school marks 
taken together, eliminating from double consideration things that they have 
in common. The remaining percentage of the variance, which is 54.22 
(1 — Re), is still to be accounted for. This remainder is given the symbol К? 
and is known as the coefficient of multiple nondetermination. This is con- 
sistent with the fact that R? + K? = 1.0, just as r* + * = 1.0 in the simple 
correlation problem. 

Relative Contribution of Independent Variables. Since the coefficient of 
multiple determination, or Rs, is composed of the two components in formula 
(16.5) and since each component pertains to only one of the independent 
variables, it is permissible to take each component as indicating the con- 
tribution of one independent variable to the total predicted variance of XI. 
This being the case, the first term, .253605, indicates the contribution to 
freshman scholarship by ability in the analogies test, and the second term, 
.204204, indicates the contribution of the high-school average. Rounded, in 
terms of percentages, these are 25.4 and 20.4, respectively. This enables us 
to obtain a more definite idea of the relative importance of each variable in 
the regression equation. We can say that ability in the analogies test, with 
what it has in common with high-school scholarship held constant, contributes 
about 25 per cent to freshman scholarship and that high-school marks, apart 
from that portion related to analogies-test ability, contribute about 20 per 
cent. We cannot take these as final or absolute, for there are other factors 
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contributing to freshman scholarship level that have not been similarly elimi- 
nated from consideration. But it is of much value to be able to compare 
contributions of variables to outcomes in this manner. 

‘The Standard Error of Estimate from Multiple Predictions. The standard 
error of estimate is again brought in to indicate about how far the predicted 
values would deviate from the obtained ones. The formula is the same as 
previously, except that the multiple R is substituted for r. It now reads 


vun = VI — Ras (Standard error of multiple estimate) (16.6) 


In the illustrative problem, 
сым = 9.1 V/I — 457809 = 6.7 


We can now say that two-thirds of the obtained X, values will lie within 6.7 
points of the predicted X, values. The margin of error with knowledge of X; 
and X, is 73.6 per cent as great as the margin of error would be without that 
knowledge. These conclusions presuppose predictions made on the basis of 
the regression equation that was obtained, and predictions made for indi- 
viduals belonging to the population and sampled at random. 

The index of forecasting efficiency may also be used by way of interpreta- 
tion and, because of its close relation to the standard error of estimate, may 
be mentioned at this point. The formula is the same as for a Pearson r [see 
formula (15.22)]. In the example of our three variables, E = 26.4 per cent, 
which means that predictions by means of the equation are 26.4 per cent 
better than those made merely from a knowledge of the mean of the XI values. 

Multiple Correlation in Small Samples. For small samples—and for 
multiple-correlation problems this means anything less than an N of 100— 
degrees of freedom should be considered in dealing with questions of sampling. 
If the multiple R and the other statistics derived from it are to be used for 
estimating population parameters, there is even more bias than for a simple 
correlation problem. 

It was stated earlier that the multiple R represents the maximum correla- 
tion between a dependent variable and a weighted combination of independ- 
ent variables. The least-square solution that is represented in computing 
the combined weights assures this result; but it really assures too much. It 
capitalizes upon any chance deviations that favor high multiple correlation. 
The multiple R is therefore an inflated value. It is a biased estimate of the 
multiple correlation in the population. If we were to apply the regression 
weights in a new sample and to correlate predicted X, values with obtained 
X; values, we should probably find that the correlation would be smaller 
than R. 

Tt is desirable, therefore, to find some means of estimating a parameter R 
which gives a more realistic picture of the general situation. A common way 
of “shrinking” R to a more probable population value is by the formula 
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R 1 — (1— *) (шыг (Correction in R for bias) (16.7) 
where V = number of cases in the sample correlated 
m = number of variables correlated 
N — m — number of degrees of freedom, one degree being lost for each 
mean, there being one mean per variable 
For the illustrative problem above, where R — .677, the corrected R? would be 


174 —1 
anm - 
R 1 — (1 — 4579) (1 ‘) A515 


from which R = .672. The correction does not make much difference here 


because the sample was fairly large and the number of variables small. There 
are problems in which the change would be very appreciable. 

A similar correction is necessary for the standard error of estimate, unless 
R has been used in formula (16.6). The general formula is 


NSA 
60 1.28. = 01.23. AAT — m (General correction of a multi- 
222 и. ple standard error of esti- (16.8) 
Nem 1 mate for bias) 
=01,|(1 = R?) . 


where the symbols W and M are as defined above. This correction also 
makes the greatest difference when W is small and m is large. 
Sampling Errors in Multiple-correlation Problems. For an R derived 
from any number of variables, the standard error is 
1-R 


oR = -p (Standard error of a multiple R) (16.9) 
V m 


in which № — m represents the number of degrees of freedom. Unless N is 
very large, and much larger than m, this formula underestimates the amount 
of sampling error. ор is subject to the same limitations as or, even more so. 
There is no z transformation that applies to R. 

When the null hypothesis is to be tested, Table D is most convenient. The 
R's meeting the 5 per cent and 1 per cent levels of significance are shown in 
columns headed by numbers of variables and rows headed by appropriate 
numbers of df. In the illustrative problem, № = 174, so the number of 
degrees of freedom is 171. The standard error of Ris 041. The obtained R 
cannot very well be more than .11 from the population value of R (.11 being 
about 2.58 times св). From Table D we find that with 150 degrees of free- 
dom (the next lower and nearest to 171) and with three variables, an R of 
198 is significant at the 5 per cent level and one of .244 at the 1 per cent level. 
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We should have little room for doubt that a genuine multiple correlation 
exists in the population. 

Standard Error of a Multiple-regression Coefficient. For the beta coefficient 
the standard error is estimated by the formula 


1 — Ra. (Standard error of a 
Thaim = ( — Br „F — m) beta coefficient) (16. 10a) 


The new symbol here is Ra 24 u, Which is a multiple R with X» as the depend- 
ent variable and all other variables except Xi as independent variables. 
There would be one of these standard errors for each of the independent 
variables in turn, each being substituted for X». For a three- variable prob- 
lem, the R in the denominator reduces to 723. Note that this formula gives 
the variance error, i.e., o°. 

For the b coefficient, the standard error is estimated by 


01.234... m. 


с 32 — 
ETEY 5 N =m —" 


Needed in the denominator for each independent variable in turn is the stand- 
ard error of estimate of that variable from all other independent variables. 
Beyond a three-variable problem this becomes quite laborious, but in the 
latter the denominator term reduces to ozs. Unlike the preceding formula, 
this gives the standard error without extracting a square root after it is solved. 

The chief use of these standard errors is to test the null hypothesis, to deter- 
mine whether each independent variable has anything at all to contribute to 
prediction when its relation to other variables is taken into account. If the 
obtained beta or 5 is not significantly different from zero, that variable might 
well be dropped from the regression equation, and a new equation derived. 

Significance of a Difference between Multiple R's. We often want to know 
whether the multiple R with more independent variables included is signifi- 
cantly greater than the R with a smaller number of variables. There is avail- 
able an F test for such a difference. The formula for computing Ё for this 
purpose reads 


(Standard error of а b coefficient) (16. 105) 


G — RUN — m —1 
ix B ES RR (16.11) 


where Ri = multiple R with larger number of independent variables 
КВК» = multiple R with one or more variables omitted 
m, = larger number of independent variables 
m» = smaller number of independent variables 


Tn the use of the F tables, the df, degrees of freedom are given by (m, — 
and the df; degrees of freedom by (N — m, — 1). d — 
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Some PRINCIPLES ОР MULTIPLE CORRELATION 


While multiple-correlation problems may be extended to any number of 
variables, before we consider the solution with more than three, it is desirable 
to examine some of the general principles that apply for any number of varia- 
bles but which can be seen more clearly when there are only three. 

The two main principles are (1) a multiple correlation increases as the size 
of correlations between dependent and independent variables increases and 
(2) a multiple correlation increases as the size of intercorrelations of inde- 
pendent variables decreases. A maximum R will be obtained when the corre- 
lations with X; are large and when intercorrelations of Xs, Xs, . . . , Xm are 
small. In building a battery of tests to predict a criterion, test makers have 
usually tried to maximize the validity of each test and to minimize the corre- 
lations between tests. There are limitations to the application of these 
objectives, however, and in practice they tend to conflict, as we shall see. 
There are also apparent exceptions to the rules, as examples will show. The 
whole story is not told by the two principles as stated. 


TABLE 16.3. EXAMPLES OF MULTIPLE CORRELATIONS IN A THREE-VARIABLE PROBLEM 
WHEN INTERCORRELATIONS VARY 


Some Typical Combinations of 719, 713, and rs. Table 16.3 provides some 
examples of various combinations of correlations among three variables that 
enter into a multiple-correlation problem. The mathematically wise student 
will be able to predict the kind of outcome in each instance, from a general 
inspection of formula (16.1). Repeated here for ready reference, it reads 


722 + т?з — 212723 
1—r'5 


R'is = 
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If the correlation 723 is zero, the third term in the numerator is zero, which 
has a tendency to make Rias larger. On the other hand, there is a distinct 
advantage in having rs; very large, because of its role in the denominator. 
If ra approaches 1.0, the denominator approaches zero. Even though the 
numerator may become small, under these conditions R could be quite large. 
A large R is thus favored by having rs; either very small or very large. This 
principle should be added to the two mentioned above. But it should be 
said also that a large ғаз is more effective when the independent variables аге 
unequally correlated with the dependent variable, and particularly when one 
of the correlations is very small. 

Note the first example in Table 16.3, in which rs; = .0. For this event, 
formula (16.1) reduces to 

dean 
In other words, when independent variables are not correlated, the proportion 
of variance predicted by their combination is equal to the sum of the propor- 
tions of variance predicted by each separately. This holds for any number 
of independent variables whose intercorrelations are zero. A psychological 
interpretation of this is that when intercorrelations among predictive meas- 
ures are zero, the total contribution of each to the prediction of a complex 
criterion containing all the things predicted is unique. 

Note next the second and third examples and compare them with the first. 
In all three, the rj» and rı, correlations remain constant at 4, while 723 
increases first to 4, then to. 9. As this happens, R goes from .57 to. 48 to 41. 
In the last instance rs; is so high that there is practically no gain from com- 
bining the two variables X; and X;. We shall see a modified result in the 
next three examples. 

In examples 4 to 6, туз remains constant at .4 and 713 constant at .2, while 
fas varies from .0 to .9. In the first of these three we find formula (16.12) 
verified. The two variances sum to .2000 and R is -45. As ra increases to 
A, R shrinks back to approximately 40. Thus we can conclude that if one 
test has a validity of .4, it may pay to add to it another with a validity of only 
2, provided the two tests intercorrelate zero. But if there is any appreciable 
correlation between them, or only a moderate correlation, it would not pay. 
What happens if we increase rs; still more? When it is as high as .9, R jumps 
to 54. This supports the third principle stated above: that ra, should be 
either very low or very high. One may ask why this principle did not appear 
to work in the first three examples. The answer is that it was obscured by 
the relation of ris and 1з. In those examples i equaled 7,3, and in the next 
three examples these correlations were unequal. A better explanation is that 
one of them is very small. One may well ask what psychological meaning is 


involved in the increase in R when rs; is very large. This is best explained in 
connection with the next three examples, 
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In examples 7 to 9, ria and ri; are still more uneven in size. They also have 
special interest because rj; = .0 in all three, while rs; varies from .0 to .4 to 9, 
as in the previous groups of three examples. It would seem, at first thought, 
that any test that correlates zero with a criterion would have no value in 
predicting that criterion. It is true that alone it has no value whatever for 
doing so. But it is not true if that test is combined with other tests with 
which it correlates. In example 7, the common-sense expectation is vindi- 
cated. The addition of an invalid test would offer no improvement. It 
would simply receive a regression weight of zero, which means it would not be 
included in the regression equation. But note that when rs; is increased to 
4, R becomes .44, and when rs; is .9, R becomes .92. Clearly a test with zero 
validity may add materially to prediction if it correlates substantially with 
another test that is valid. 

Suppression Variables. The psychological significance of this is best 
explained by factor theory (see Chap. 18). Roughly, the answer is that 
variable X», in spite of its positive correlation with X;, has some variance in 
it that correlates either zero, or perhaps even negatively, with the criterion. 
This same variance prevents X» from correlating as highly as it might with 
XI. Variable X; correlates with X; because they have in common that 
variance not shared by X;. In this kind of situation we find that X; acquires 
a negalive regression weight, although it may correlate only zero, and not 
negatively, with the criterion. We call such a variable a suppression variable. 
Its function in a regression equation is to suppress whatever variance in other 
independent variables may not be represented in the criterion but which may 
be in some variable that does otherwise correlate with the criterion. 

An example of this came to the author's attention in testing for pilot selec- 
tion. It was a consistent finding that a vocabulary test, which is as pure a 
measure of the verbal-comprehension factor as we have, correlated zero or 
even slightly negative with the criterion of success in pilot training. The 
same kind of test correlated substantially with a reading-comprehension test 
which also correlated positively with the pilot criterion. The reading test 
correlated positively with the criterion because it measured, besides verbal 
comprehension, such factors as mechanical experience and visualization which 
were also component variances in the pilot criterion. The combination of a 
vocabulary test with the reading test, with a negative weight for the vocabu- 
lary test, would have improved predictions over those possible with the read- 
ing test alone. 

The examples mentioned thus far have had only positive correlations 
involved. In most practice where human variables are measured we have 
only zero or positive correlations, if all measurement scales are aligned so that 
“good” qualities are given high numerical values. Where genuinely negative 
relationships do occur they are likely to be very small. Examples 10 and 11 
in Table 16.3 are given more for their academic than for their practical 
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interest. Example 10 should be compared with examples 4, 5, and б. They 
differ only in the value of ғаз. When rs; becomes negative, we see that the 
increase that occurred when r2, approaches zero appears to continue as 723 
becomes increasingly negative. When rs; is —.4, R is even greater than when 
ro; is 9. It is doubtful whether situations like example 10 occur in nature, 
though they are theoretically possible. The trend could not go too far, 
however, for with 723 large enough in the negative direction we should come 
to a multiple R greater than 1.0, which would mean an impossible situation, 
even mathematically. 

Example 11 has two negative correlations, rj; and rss. These simply mean 
that variable X probably has a reversed scale, for X; is related to both X, 
and X; in the same direction. Note that the multiple R is the same as if 
both vis and rss were positive and of the same size numerically (example 2). 

Multiple-R Principles in Larger Batteries. The principles illustrated 
above for the three-variable problems also apply in larger combinations of 
variables. The first two principles can be well illustrated by taking other 
hypothetical examples like those in Table 16.4. There we have a demonstra- 
tion of how multiple R’s behave as the number of independent variables 
increases from 2 to 20 and as intercorrelations increase from .0 to .6. 


TABLE 164. MULTIPLE CORRELATIONS FROM DIFFERENT NUMBERS Or INDEPENDENT 
VARIABLES EACH CORRELATING 30 WITH THE DEPENDENT VARIABLE BUT WITH 
INTERCORRELATIONS VARYING* 


Number of Intercorrelations 
independent 

variables .00 .10 .30 .60 
1 (.30) (.30) (.30) (.30) 

2 42 .40 37 34 

4 -60 83 4 .36 

9 .90 67 48 37 

20 — 79 .52 .38 


* Adapted from Thorndike, R. I. Research Problems ond Techniques, in the AAF Aviati 
Research Program Reporis, No. 3, Washington, D.C.: GPO, 1947, АА Aviation Psychology 


Following Thorndike’s choices, we shall assume that each variable corre- 
lates with a criterion to the extent of .3. This is a rather low validity coeffi- 
cient, and about the lower limit of usefulness for a single test or other predic- 
tive device. We shall see, however, how valuable such instruments may be 
when combined in a battery, provided their intercorrelations are not too high. 

In the second row of Table 16.4, when two such tests are combined, we see 
how the multiple R decreases from .42 when 723 is Zero to .34 when = is .60. 
Tn each row the same expected phenomenon occurs: a decrease in R as inter- 
Correlations increase, Inspection of the columns shows how R increases as 
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we add more tests of the same kind to the battery and how the gain in R con- 
tinues up to a battery of 20, except for the case of zero intercorrelations, for 
which the limit of R = 1.0 was passed when the number of tests exceeded 11. 
In this situation (intercorrelations zero) the principle of formula (16.12) still 
applies. The proportion of predicted variance contributed by each test 
would be .09, and 11 tests would yield a multiple R of .995. In other columns 
the increases of R are less drastic, but except in the last column, and perhaps 
in the one preceding, it would apparently pay to continue adding new tests 
until the 20 were included. Matters of administrative effort would have to be 
balanced against gains in R. 

Table 16.4 tells an even more important story. The value of having zero 
intercorrelation among tests in a battery is obvious. If one tries to achieve 
zero intercorrelations among tests, each test measuring a unique factor, how- 
ever, he will often find that each test tends to correlate low with the criterion. 
This is because a practical criterion, of training achievement or of job per- 
formance, is usually a complex variable; it has a number of component vari- 
ances, each component being а common factor (see Chap. 18). If one tries to 
increase the correlation of a test with a criterion, the result is almost invari- 
ably to increase the factorial complexity of the test, to bring in more different 
factor variances. This automatically raises the correlation of this test with 
other tests, because they have more factors in common. This is the reason 
that in practice the two principles mentioned first lead to conflicting objectives. 

Where there has to be a choice, it seems wisest to give less attention to the 
first principle (of maximizing correlation of each test with the criterion) and 
greater attention to the second (of minimizing intercorrelations). If there 
are 20 independent factors represented in a practical criterion, and if each is 
of equal importance, each would contribute .05 of the total variance. Each 
test, measuring only one of the factors, would need to correlate only V. 05, 
which is. 224, with the criterion. In this case, raising the correlation between 
any one test and the criterion would be of little use. There would be no 
objection to a higher correlation. Appropriate weighting would bring the 
test's contribution to prediction down to required proportions. Thus, it can 
be concluded that low correlations of tests with practical criteria can be 
tolerated, provided we can combine enough tests in a battery and provided 
their intercorrelations are near zero. 


MULTIPLE CORRELATION WITH MORE THAN THREE VARIABLES 


With more than three variables, a good solution of a regression equation 
and of a multiple R is by means of the Doolittle method. This procedure 
will be outlined step by step for a five-variable problem. We shall use all the 
variables represented in Table 16.1, asking what regression weights would 


1 For a more detailed discussion of these problems, see Guilford, J. P. New standards 
for test evaluation. Educ. psychol. Measmt:, 1946, 6, 427-438. 
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best predict X from the other four combined and what the correlation of 
those predictions with obtained X, values would be. 

Solution of Normal Equations. The mathematically inclined reader will 
appreciate better what is transpiring in applying the Doolittle method if he 
knows that he is actually solving simultaneous equations. The unknowns 
are the beta coefficients, and there are as many equations as unknowns. For 
a five-variable problem, in which there are four unknown betas, the equations 
аге 


ju ji is e Mv Heure M (Normal equations for 

"Ви 1 ғ, "зъб “ти 

ridi xt 5 4 "Bu + га = tá t Ere (16.19) 
тийи + rufis Hrobu + Bis = 


The beta coefficients are symbolized in abbreviated form here to conserve 
space. Bis, in full, would be 812.345 and fi; would be 82.245, and so оп. The 
equations are systematic, the r coefficients being arranged as in the original 


Tanik 16.5. SOLUTION Or A MULTIPLE-CORRELATION PROBLEM BY THE 
DoorrrruE METHOD 


Column number Check 
Variable x Sum 

Row | paso 

A |ru 1.0000 2.6250 
B |А+(-А2) —1.0000 —2.6250 
C |ru — 

D |Axm E Кее 
E |с+р — 3 

f: 

Р |Е+ (-K3) = s 5920 
G |ra 

H |AX B ee 
E No 
J \G+H+!I 

x [ye (0 004) — . -1.6519 
г lew sect i 

M |Ax BS UE 
N |EXFS = 58171 
о |ухк5 Ыы ең 
P \L+M+4+N+0 

Ra um 
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table of intercorrelations (see Table 16.1). The betas in the diagonal posi- 
tions might be expected to have coefficients ^з», rss, ru, and rss attached to 
them, but instead the coefficients attached to these betas are all +-1.0, as the 
least-square solution requires. 

The Doolittle-solution Operations. First we prepare a work sheet like 
that in Table 16.5. There is a column for every variable and the number- 
ing corresponds. A last column is introduced for the purpose of checking the 
calculations, as will be explained. The rows are designated by letters, and 
in the first column a shorthand instruction is noted. These will beexplained. 


Step 1. Record in row A the correlations with Xz. These are obtained here 
from Table 16.1. In column 2, a coefficient of 1.0000 is inserted, 
because it is demanded by the Doolittle method. We are going to 
carry four decimal places throughout the solution (one more than 
those given in the r's), and so we record all numbers to four places. 

Step 2. Sum the values recorded in row A, and give the sum in the last, or 
“check,” column. This will be used later. 

Step 3. Divide the numbers in row A each by —1,0000, In the table, the 
instruction reads “А + (—42)," which means that each number 
in row A is to be divided by the number that appears at 42 (row A, 
column 2) withsign changed. This includes the last column as well. 

Step 4. Record in row C all the remaining correlations with X; We say 
“remaining,” because one is already recorded, namely, ras. The 
value of 1.0000 is recorded at C3. 

Step 5. Sum all the correlations with Xs, including the .5620 in row A. 
Record the sum in the “check” column. 

Step 6. The numbers in row D are found by the instruction “A X B3," 
which means to multiply all the numbers in row 4 (beginning in 
column 3) by the number that appears in row Bandcolumn3. This 
number is —.5620 in Table 16.5. 

Step Row E calls for the addition of all numbers in rows C and D. 

Step 8. Row F calls for the division of all numbers in row E by the number 
appearing in row Zand column 3, with sign changed. This number, 
with sign changed, is —.6842. 

Step 9. We аге ready for the first checking of calculations. Sum the values 
in row F, not including the last column. This should equal approxi- 
mately —1.8720 in this particular problem, which was found by the 
steps already described. If there is a serious discrepancy here 
(other than in the fourth decimal place), check row E by adding 
values up to the check column. If this does not check, there is an 
error further back, and some recalculating is in order. All checks 
should be satisfied before proceeding. 

Step 10. In row G, record remaining correlations with Ху, with 1.0000 at G4. 


ЕХ 
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Step 11. Sum all the correlations with X4, and record in the last column in 
row G. 

Step 12. Values in row H are the products of values in row A times the num- 
ber at B4. This number is —.4010. 

Step 13. Values in row J are the products of numbers in row Æ times the num- 
ber at F4, which is —.2493. 

Step 14. Sum the numbers in rows G, H, and 7 for each column. 

Step 15. Divide row J through by the number at J4, with sign changed; in 
other words, by —.7967. 

Step 16. Check by summing row K up to the last column. Does the sum 
agree with the number already found in that column? 

Step 17 and after. By now the abbreviated instructions for each row should 
be clear by analogy to those already given. The final check is made 
in row Q. 


The illustrative solution is set up for a five-variable problem, but a larger 
number of variables would be treated in a similar manner simply by extending 
the table to more rows and columns. A smaller number of variables would 
mean fewer rows and columns. It will be noticed that the table is set up in 
terms of blocks of work, each one beginning with the entrance of correlations 
for a new variable and ending by dividing by a number that will assure a 
— 1.0000 as the first number in the last row of that block. The work will be 
found to be very systematic throughout. Any variable may be treated as 
the dependent variable, but it must then occupy the next to the last column 
in the table. 

Solution of the Beta Coefficients. The work represented in Table 16.5 is only 
a part of the Doolittle solution. The end result gives the beta coefficients, 
which we find by means of a “back solution," so called because we work in à 
backward direction, as compared with the work in Table 16.5. This work 
can be tabulated, but it is probably clearest to the beginner in the form of 
equations. The first beta found is 815, which can be located without further 
ado in Table 16.5. It is the number at the intersection of row Qand column 
1, but with sign changed (in other words, it is described as —Q1). Big is 
therefore +.1607. The other betas require more work, and so we shall follow 


the procedure step by step, including again the first step already taken, for 
the sake of completeness. 


Step 1. Bis = —Q1 = +.1607 
Step 2. Bi, = —K1 + Bis(K5) = .3506 + (.1607)(—.3012) = +.3022 
Step 3. Big = -F1 + Bis(FS) + Bra(F4) 

= 4702 + (.1607)(~.1524) + (.3022)(—.2493) = +.3703 
Step 4, Bis = — B1 + Bis(BS) + 81,(B4) + 8,:(83) 


4650 + (.1607)(—.1970) + (.3022)(—.4010) 


+ (.3703)(—.5620 
= +.1039 " : 
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Before going further, it is well to check the calculations of the beta coefti- 
cients. This can be done by using the last equation in (16.13): 


Bares + BU + Buras + Bis = 716 
Substituting known values, 
(.1039)(.197) + (.3703)(.215) + (.3022)(.345) + .1607 = .3651 


Since ris = .365, the check is satisfied, and we may assume that there has 
been no error in computing the betas. This checking procedure can be sum- 
marized as in Table 16.6, which provides a convenient work plan. 


TABLE 16.6. A CHECK UPON THE COMPUTATION OF THE BETA COEFFICIENTS 


Burrs 


X: .0205 
Xi .0796 
X. .1043 
X; .1607 


= .3651 = ris 


The Solution of Regression Weights and the Multiple R. Each b coeffi- 
cient needed in the multiple-regression equation is found from its correspond- 
ing beta. Equations like those in formulas (16. 20) and (16.25) apply. The b 
weight for Xz should now read in full b12.345 to indicate that we are interested 
in the relation of X, to Хз, other variables, Хз, Ха, and Хь, being held con- 
stant. For the sake of brevity (as, indeed, we have already done for the 
betas), we shall denote the 6’s only by the first two subscript numbers b12, b13, 
etc. In the solution of a multiple R, equation (16.5) needs to be extended to 
include as many terms as there are variables. К? is the sum of the products 
of beta times its corresponding 7, i. e., 

Tic быа 85 фк, R А beta coefficients) CSIR 
The a coefficient in the equation is also found by formula (16.4), extended 
with as many terms as necessary. It is the mean of the X; values minus the 


products of other means times their corresponding 6 weights, as 
a = Mi — М — bi — bu — * (16.15) 
(Constant a in a ES regression equation) 
All these operations are conveniently carried out in a work sheet like Table 
16.7, where R and the regression weights are systematically calculated. The 
second column contains the four betas. The third contains the original, or 


raw, correlations of the four variables with XI. The subscript k stands for 
variables 2 to 5 in turn. The fourth column contains the cross products of 
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betas times corresponding r's. Their sum is Re, which here is .487855; and 
by taking the square root we find R is .698. This R, with full subscript, 
would read Rı.2345- 


TABLE 16.7. SOLUTION OF THE REGRESSION COEFFICIENTS FOR THE 
MULTIPLE-REGRESSION EQUATION 


(4) (8) 

Виль м, (-Mgbi. 
X: .048314 19.7 — 3.585 
X: .214885 49.5 — 9,801 
X. 165001 61.1 — 8.676 
Xs .058655 29.7 —11.732 
> .487855 = R? У —33.794 
698 =R M; 73. 800 
a= 40.006 


So much for the multiple R, which we see is not increased very much by 
including two more variables (X, and Xs) over that obtained when we used 
only X; and Ху. Then R equaled .677. The coefficient of determination is 
now .4879, or we have accounted for 48.8 per cent of the variance of freshman 
scholarship, as compared with 45.8 per cent without using X» and Xs. The 
standard error of estimate (now designated as от. зза in full) equals 6.5, where 
before it was 6.7, a trifling change. The index of forecasting efficiency is now 
28.4 per cent, where before it was 26.4 per cent. It is therefore questionable 
whether the trouble of measuring and using in the regression equation the two 
additional variables is worthwhile. This is a good example of the way in 
which each additional variable yields diminishing returns in the way of 
improved predictions. 

For the solution of the 5 coefficients, we introduce in Table 16.7 first the 
column headed . This is the ratio by which each beta is to be multiplied. 
The 6 coefficients follow in column 6. They tell how many units X; is 
increasing for each unit of increase in the other variables. From these taken 
alone, it would seem that X; (interests) has the greatest bearing upon fresh- 
man marks and that X; (high-school average) has the least. But such is not 
the situation. The best comparison of each variable's contribution to the 
variance in X, is to be seen in column 4, where each beta is multiplied by the 
corresponding raw r. Here it is seen that X; contributes about 21 per cent, 
X, nearly 17 per cent, whereas X; contributes only about 6 per cent, and X; 
about 5 per cent. These statements are relative to this correlational situ- 
ation, with the influences of overlapping among the four taken into account. 
But as to choices among the four variables that we have here, they come in 
the same rank order as the 8r products. 


сн. 16] MULTIPLE PREDICTION 411 


For the solution of the a coefücient, the last two columns are included. 
This coefficient turns out to be exactly 40.0. The entire regression equation 
now reads 


Хү = 40.0 + .182X; + .198X; + .142X, + 395 X; 


With this equation, we could predict an XI for every student, knowing his 
four scores in the other variables. As was said before, the addition of the 
terms involving Xs and X; yield scarcely enough additional accuracy of pre- 
diction to justify their inclusion. One could try combinations of three pre- 
dictive indices, variables Xz, Xs, and X4, or X;, X4, and X;, to see what hap- 
pens. From the results in Table 16.7, it would seem that the last-mentioned 
combination of three is the more promising. One could determine by another 
Doolittle solution whether it increased R sufficiently above .677 to justify the 
inclusion of Ху with X; and X. 


SHORT SOLUTIONS FOR REGRESSION WEIGHTS 


Solution of a multiple-regression problem, even with the convenient Doo- 
little procedure, becomes energy- and time-consuming when the number of 
variables is large. The author has known of test batteries involving as many 
as 20 possible scores that could be combined each with its appropriate weight. 
When there are more than six variables the situation calls for possible short 
cuts or approximation methods. Two methods will be mentioned to meet 
this need, one of which will be illustrated. 

The Wherry-Doolittle Method. In recent years a modified Doolittle solu- 
tion has been introduced by Wherry.! The method was designed to meet the 
requirement of assembling a battery of tests to select personnel for some par- 
ticular assignment. It takes particular cognizance of the fact that when a 
large number of tests are validated singly for the prediction of a certain 
criterion, only four or five when combined often seem sufficient. Asa matter 
of fact, adding tests beyond the point at which all the factors that the tests 
measure in common with the criterion are covered often merely contributes 
error variance to the composite. Even before the point has been reached 
where there is no apparent improvement in prediction, errors have entered 
into the picture to help determine the regression weights. This point was 
mentioned earlier in connection with the discussion of shrinkage formulas 
[see formulas (16.7) and (16.8)]. 

The principles of the Wherry-Doolittle method are, briefly, as follows: One 
Starts with the single test that seems to offer most in prediction of the 
criterion. The method then aids in selection of the second test that will have 
most to add to prediction when combined with the first. A third can be 
selected which will add most by way of prediction when combined with the 

1 Described in full in Stead, W. H., Shartle, C. L., e? al. Occupational Counseling 
Techniques. New York: American Book, 1940. Pp. 245-255. 
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first two, and so on. At each step a shrinkage formula is applied in order to 
determine whether the shrunken R is appreciably larger than the previous R. 
At the point where no further gain according to these standards is apparent, 
no more tests are added. 

The method does undoubtedly offer an efficient way of assembling a battery 
of tests to meet a particular purpose. It results in a list of predictive instru- 
ments that, out of a larger number tried experimentally, is minimum for 
doing the job. 

The author is inclined toward a quite different philosophy of development 
of test batteries, however, which would render the Wherry-Doolittle pro- 
cedure unnecessary when there is sufficient information about the criterion 
and the tests.! For this reason the space that it would take to explain and 
demonstrate the Wherry-Doolittle method is not used here. The reason why 
only four or five tests have seemed to be the limit in a useful battery is because 
only a limited number of the human abilities and other traits that are involved 
in a practical criterion have been represented in the tests. Although a dozen 
different tests may have been tried out, the same limited number of funda- 
mental factors have been measured by them and the measurement is dupli- 
cated several times over. If a careful study of the criterion is made, revealing 
all the factors that are worth trying to predict, and if there is sufficient variety 
in the tests to take care of all the factors, it will be found that more than four 
or five tests will probably be needed. If one knows that there are 10 traits 
in the criterion that are worth covering with tests, and if it takes 10 tests to do 
it, then one could put the 10 tests in a battery and expect that every one 
would have something to contribute toward prediction. A successive selec- 
tion of tests by a method such as the Wherry-Doolittle would then be 
unnecessary. 

An Iterative Solution of Regression Weights. The iterative procedure for 
computing beta weights to be described and illustrated is economical, par- 
ticularly for a problem with many variables, and will probably lead to satis- 
factory results in most cases.? The operations will be described step by step 
and are illustrated in Table 16.8 with the use of the same data to which the 
Doolittle method was applied earlier. 

The general principle of the method is (1) to guess what the betas are going 
to be, (2) to substitute them in the normal equations [see equations (16.13)], 
(3) see how much discrepancy there is between the known validity coefficients 


1 For a discussion of this at some length, see Guilford, J. P. Factor analysis in a test 
development program. Psychol. Rev., 1948, bb, 79-94. 

The procedure is the author's version of R. L. Thorndike's adaptation of one originally 
developed by Kelley and Salisbury. See Thorndike, R. L. Research Problems and Tech- 
niques. AAF Aviation Psychology Research Program Reports, No. 3. Washington, D. C.: 
GPO, 1947; also Kelley, T. L., and Salisbury, F. S. An iteration method for Чел 
multiple correlation constants. J. Amer. statist, Ass., 1926, 21, 282. 
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TABLE 16.8. AN ITERATIVE SOLUTION OF THE BETA COEFFICIENTS 
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and those that follow from the guessed betas, and (4) make corrections in the 
guessed betas. These steps are repeated until the discrepancies practically 


vanish. 


The correlations that enter into the normal equations are listed first 


in the worktable, upper left-hand corner. From here on the steps will be 


listed. 


Step 1. 


Step 2. 


Step 3. 


Step 4. 


Step 5. 


Compute the sum of each column of correlations, Tra, where a stands 
for each of the independent variables representing columns and & 
stands for each of the variables in rows in turn. Tra in the first 
column of correlations is Er, and so on. 

Make a guess for the size of each beta (these will be ga. 815, and so on) 
by dividing the validity coefficient for each test by the sum of its 
column of r’s. These may be made to two decimal places to start 
with, but one place will do about as well. For example, Bla is esti- 
mated by the ratio .465/2.160 which equals .215, but this has been 
rounded to .2. fj, is estimated by the ratio .583/2.173, which is 
.268, rounded to .3. With more variables, a multiple of each such 
ratio would be a better estimate. 

Solve each equation, substituting the guessed betas for the unknown 
betas. The first equation would read 

(1.000)(.2) + (.562)(.3) + (.401)(.3) + (.197)(.2) = 5283 

This gives a value symbolized by rj, (for the first equation it is r{4) 
and recorded in the column just after the validity coefficients, ry. 
Four decimal places will be carried from here on in order to obtain 
three significant digits in the betas. 

Find the discrepancy between each validity estimated from the use 
of the guessed betas and the obtained validity. Call these values d;. 
For the first test, di = rj; — ri: = +.0633. This means that, with 
the betas which were assumed, the validity of variable X; would have 
to be .0633 higher than the validity of .465 which had been obtained. 
The d; of —.0088 for variable X; indicates that the guessed betas 
underestimate the validity of that test. 

Make the first change in the guessed betas. Although we can see 
that the betas for Xs, X4, and X; have been perhaps overestimated 
and that for X; underestimated, it is most convenient, and perhaps 
just as expedient, to make only one change at a time. Note where 
the largest discrepancy is. It is the +.0633 for variable Xs. If we 
make a change only in i» it will affect only the first term in each 
equation and will involve only the first column of correlations. To 
lower di to zero for the first test in the list, we would need to multiply 
1.000 by some amount that will cancel it. A change of —.0633 would 
do this, but it is best to limit adjustments to the second decimal place 
at this stage. We shall therefore reduce gi by —.06, making it .14. 
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Step 6. Modify the discrepancies in line with the change in 8/, just men- 
tioned. Every d; will be altered by adding to it the product of 
the change times the corresponding value ry. The first d, will 
be +.0633 + (—.06)(1.000) = +.0033. The second d; will be 
—.0088 + (—.06)(.562) = —.0425, and so on. 


The general pattern of the procedure is now complete. We keep on making 
successive adjustments as called for, computing the altered discrepancies, 
with an attempt to reduce them almost to zero. Since we are expecting 
three-place accuracy in the betas, we shall find that it pays to continue until 
the discrepancies are not over .0005. After we have achieved good adjust- 
ment up to the second decimal place in the betas, we then proceed to make 
adjustments in the third decimal place. A comparison of the betas found in 
Table 16.8 with those found by the Doolittle solution (see Table 16.7) will 
show very good agreement to the third decimal place. 

From the beta coefficients found in this manner one may proceed to com- 
pute the multiple correlation, the b weights, and other derived statistics. 

Great care should be taken for accuracy of computation. Errors may creep 
in at.any stage and it still might be possible to reach what looks like a satis- 
factory solution, that is, with zero discrepancies, with wrong betas. It would 
certainly be well to check the accuracy of the obtained betas as was done 
following the Doolittle solution. There may be some problems, with peculiar 
combinations of correlations, in which the iteration would not achieve zero 
discrepancies even after a long series of trials. The author has not encoun- 
tered such a situation as yet. The routine described above may be modified 
as the user of it gains experience. There are opportunities for making wiser 
choices of betas and changes in betas that might cut the number of steps. 

Thorndike makes some suggestions concerning the original source of 
guessed betas.! If we have prior knowledge of how a given test has per- 
formed in a similar battery for making a similar prediction, it would be well to 
start with that knowledge. If the battery is a very large one (10 or more), it 
would be desirable to start with about half of the guessed betas equal to zero. 
Kelley and Salisbury had suggested that each beta be guessed as about half 
the corresponding validity coefficient, but Thorndike suggests between one- 
fourth and one-half is better. Ifa test correlates relatively low with others, 
the chances are that its beta will go higher than original estimates, and, con- 
versely, if it correlates relatively high with other tests, its beta will prove to 


be lower. 
CoMBINATIONS OF MEASURES 


The regression equation is a means of combining different measures of the 
same object in order to derive a composite measure or score. The scores are 
summed, each weighted by its regression coefficient. There are other ways 


1 Thorndike, op. cit. 
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of combining scores to form a composite. For example, one might simply 
sum the raw scores for each person without applying differential weights. 
This is the common practice in deriving total scores of tests composed of 
subtests of different kinds, though in some cases there is some effort at weight- 
ing, for example, multiplying one score by 2, another by 3, and so on. 

Actually, every test that is composed of items may be regarded as a battery 
of as many tests as there are items. The total score is usually an unweighted 
summation of the item scores, though in many interest and temperament tests 
there may be differential weighting. Rarely does a test maker resort to the 
determination of regression weights for test items, but the same principle that 
applies to test batteries could be adapted to single tests composed of parts. 
More often than not, even in the case of test batteries, there are so many 
parts, or they are used to predict in such a variety of situations, that there is 
not sufficient incentive to work out the regression weights. 

Because there must be substitute weighting procedures in combining tests, 
it is important to know some of the better substitute procedures for the 
multiple-regression equation and to be able to evaluate the effectiveness of a 
composite derived by any method. The multiple R applies only when the 
optimal regression weights are used; other weights will yield a composite that 
is likely to correlate less with the criterion. There are other problems con- 
nected with composite scores that call for attention, including what mean and 
what standard deviation will result when measures are combined each with a 
certain weight. These problems will be dealt with in following paragraphs. 

Means of Weighted Composites. When several measures of the same 
object are summed, each with its own weight, the mean of the same kind of 
composite for a sample of objects is given by the equation! 


Mos = TA.. (Mean of a sum of weighted measures) (16.16) 


where w; weight applied to each variable Ху, when i varies from 1 ton ina 
T ol n variables, and M, = mean for the same sample of objects in variable 
a 
If we apply this to the 6 weights computed for the regression equation in 
the prediction of freshmen average grades (see p. 410), the solution would be 


Ma = CA + (198)(49,5) + (.142)(61.1) + (.395)(29.7) 


Thus, the mean of the composite of four variables, including X» (arithmetic 

test), Xs (analogies test), X, (high-school average), and Xs (interest score), 

weighted with the coefficients .182, .198, .142, and 395, respectively, would 

be 33.8. This value is 40.0 units short of the mean for the criterion (freshman 

grades). By adding the difference (40.0), which is the a coefficient of the 
! For proof, see Appendix A. 
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complete regression equation, we obtain a composite mean that coincides 
with that of the criterion. This discussion, in other words, explains the need 
for the a coefficient in the complete regression equation. If we were not 
interested in achieving that mean, we could drop the constant 40.0 and be 
left with a mean of 33.8. 

Standard Deviations of Weighted Composites. We can likewise estimate 
the standard deviation of a composite measure when each component has a 
multiplier or weight. The computation of this statistic may be clearer, how- 
ever, if we consider the standard deviation of a simple unweighted sum first. 

The Standard Deviation of Sums When Weights Equal One. When scores 
from different tests are summed without applying differential weights, we 
may regard the weight for each test to be +1. When rwo scores are summed 
to make the composite, the variance of the composite scores is given by the 
equation! 

ot, = EUR en 05 a sum of two unweighted (16.17) 
where o°; and о, = variances of the components and туз = coefficient of cor- 
relation between the two components. 

The expression r;20172 is known as the covariance of the two components. 
Its relation to correlation can be better shown by relating it to the Pearson 
formula, in which 

= — 
Naio 


If we multiply both sides of this equation by соз we have 


Exita 
1120102 Z N 


The parallel between the term at the right and the expression for a variance 
should be obvious. A variance is of the form 2Za*,/N or Zx*/N. А 
covariance is the mean of the cross products of deviations; a variance is a 
mean of the squares of deviations. With this new information as back- 
ground, we may translate equation (16.17) into English by saying that the 
variance of a composite is equal to the sum of variances of the components 
plus twice the covariances of all pairs of those components. This is a general 
principle that is important to remember. 
From equation (16.17) it follows, by taking square roots, that 


— — ol 
в Ven Ne ee — (Standard deviation eee (16,18) 


A demonstration of how this works out in a particular sample is given in 
Table 16.9. Ten scores are given for the same individuals in X, and in X, 


! For proof, see Appendix A. 
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between which the correlation ra equals zero. If r = 0, the third term in 
formula (16.17) drops out and the variance of the composite 1s merely the sum 
of the variances of the components. 


TaBLE 16.9. THE VARIANCE AND VARIABILITY OF A COMPOSITE Score Tmar Is 
THE UNWEICHED Sum or Two UNCORRELATED SCORES 


Ki 
Xa К s 
A 1 —4 | 16 6 0| 0 if —4 16 
B 3 —2 4 7 +1 1 10 —1 1 
с 4 —1 1 4 —2 4 8 —3 9 
D 5 0| 0 10 +4 | 16 15 +4 16 
E 5 0| 0 8 +2 4 13 +2 4 
F 5 0| 0 0 —6 | 36 5 —6 36 
G 5 0| 0 6 0| 0 и 0 0 
H 6 +1 1 8 +2 4 14 +3 9 
I 7 T2 4 5 —1 1 12 +1 1 
И 9 +4 | 16 6 үл 0 15 +4 16 
z | 50 0 | 42 60 0| 66 110 0 | 108 
M| 5.0 4.2 6.0 6.6 11.0 10.8 
c 2.05 2.57 3.29 
p————————————————————— 


In the illustration in Table 16.9, the variances of the two components are 
4.2 and 6.6, respectively. Their sum is 10.8, which checks with the mean of 
the square found from variable X.. The way in which variances combine is 
also demonstrated in Fig. 16.3, which pictures hypothetical distributions for 
Xa, X», and their sum X.. The position of the scale for X, is determined by 
the juncture of the lines erected at distances of 10 from the means of X, and 
Хь. The slanted scale of X, is closer to that of Хь, consistent with the fact 
that Хь contributes more variance to it than does X, and the fact that the 
composite correlates higher with X, than with Xa. But these are incidental 
considerations here. The important demonstration is that when two varia- 
bles like X, and X, are uncorrelated, we may regard the standard deviation 
of their composite X. as the hypotenuse of a right triangle of which о, and c; 
are the legs. The old, familiar Pythagorean theorem thus applies to the 
summation of two independent variables. 

Relation of с, to the Standard Error of a Difference. The similarity between 
equation (16.18) and equation (9.19) for the standard error of a difference will 
probably have been noticed. The only difference is in the algebraic sign of 
the covariance term, 259,0», which is positive in the case of о, and negative 
in the case of ca. Of course, in the preceding discussion of с, we have been 
applying it to distributions of single observations, whereas oa has been applied 
to distributions of means (mean differences). The principles are the same, 
either with means or with single observations. Had we written the summa- 
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tion equation in the form X, = X, — Хь, instead of X. = X, + Xs, we 
should have been dealing with differences instead of sums. On the other 
hand, in the equation X, = X, — Хь, we can say that we actually have a 
summation of scores, those for X, having a weight of +1 and those for Xy a 
weight of —1. 
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Score in test A (Xa) 4 
Fro. 16.3. Illustration of the way in which the standard deviation of an unweighted sum 
of two scores is related to the standard deviation of those two scores taken separately 


when the two are uncorrelated. 


Variance of a Composite of More Than Two Components. Equation (16.17) 
can be extended to include any number of unweighted components. For each 
component there would be its variance but there would be as many covariance 
terms to include as there are pairs of components. With three components 
there would be three covariance terms: 27190102, 27130103, and 27230203. Where 
there are n components, there are n( — 1)/2 pairs and-n(n — 1)/2 covari- 
ances to consider. In terms of a general formula, 


"Variance of a sum of any number of T 
0, = Bor; + 2 тусто; unweighted components) (16.19) 


where o?; = variance of any one component, X; 
7. = correlation between any component X, and any other component 
with a higher subscript number 


c; and о; = standard deviations of the two components correlated 
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Variance of a Composite of Weighted Components. When the components 
are weighted differently, the variance of the composite will reflect the weights. 
Let us begin with the special case of two components. If the summation 
equation is of the form 

Xo, = WX, + ws Xs 


the variance of Xw, is given by the equation! 


Fy = wy, + who, + 271100 0303 „ (16.20) 
ponents) 
where w, and v; = weights applied to components X; and Ху, respectively. 
As an example of this type of problem, let us use the data on X, and X. sin 
Table 16.1. If these two variables are used in a composite to predict Xi, the 
least-square solution gives 5 weights of .224 and .491, respectively, and a 
multiple R, based upon these weights, of .578, The predicted X values based 
upon the equation X; = .224X,+ .491X would be expected to have a 
standard deviation equal to Rias times ол. This product is .578 X 9.1, which 
equals 5.26. Let us see whether formula (16.20) will lead to the same result. 
By substituting the appropriate values, 


Owe = (.224%)(19,42) -+ (4912) (3,72) + 2(.345)(.224) (19.4) (.491) (3.7) 
= 27,6319 


from which gu, = 5.26 


This agrees exactly with the expectation. 


With weights of +1 for both X, and Xs, application of formula (16.17) 
would have given 


0%, = 19,4 + 3,72 4 2(.345) (19.4) (3,7) 
= 439,5782 
from which 9. = 21.0 


Variance of a Composite of Any Number of Weighted Components. When 
there are more than two components, each weighted differently, the variance 
of the composite is given by the general formula? 


Mec Bele + Tasa, бй ot a mum of any num- ds 91 


where w = weight assigned to variable Х,, where i takes on values 1 to n — 1 
in turn 
ry = correlation between X; and any other variable X;, where jisa 
subscript greater than i 
s; and оу = standard deviations of X, and Ху, respectively 


* For proof, see Appendix A. 
See Appendix A. 


сн. 16] MULTIPLE PREDICTION 421 


We could apply formula (16.21) to the four components of the regression 
equation predicting freshman grades with the appropriate b weight sub- 
stituted for win each case. We should find that the standard deviation is 
equal to R times е, which is .698 X 9.1 = 6.35. The inclusion of variables 
X» and X; in the regression equation raises the dispersion of the predicted 
grades from 5.26, which it would be with X, and X; only, to 6.35. 

Achieving Any Desired Standard Deviation in a Composite. In using 
regression equations, the dispersion of the predictions falls short of that of the 
obtained values. This is all right and proper when we are interested in pre- 
dicting an individual’s most probable measure on the scale of obtained meas- 
ures in Ху. The regression of predictions toward the general mean is a 
natural phenomenon of imperfect correlation, as was pointed out before 
(Chap. 15). There may be other uses of composites, however, that call for 
other values than those given by the regression equation, Suppose that we 
wanted predictions to spread just as much as the obtained values do. Sup- 
pose that we should want them to be dispersed with some standard varia- 
bility, for example, with a с of 10.0, as on a T scale, or a ø of 2.0, as ona C 
scale (see Chap. 19). The way that kind of goal can be achieved will now 
be explained. 

Fortunately, for the solution of this problem, it is not the absolute sizes of 
the weights that matter; it is their ratios to one another. So long as they 
bear the same relations to each other, the correlation of the composite. with 
some criterion will remain the same. Consequently, we could double, triple, 
or otherwise change the regression weights by some common multiple, without 
affecting the predictive value, if all we want is to predict individuals in the 
same relative positions in a distribution. 

The o of the predictions is always related to the « of the obtained values by 
the extent of the correlation (when optimal weights are used). In a multiple- 
regression problem, ø of the predicted values equals R times the o of the 
obtained values. We can therefore make the с of the predictions equal the o 
of the obtained values by dividing each regression coefficient by R. A read- 
justed b coefficient, then, would be computed by the formula 


c (Regression coefficient adjusted 
Disisu.m = Bistum (An to make the с of a composite (16.22) 
E RI An., equal ол) 


If the o desired in the composite із 10, or 2, or any other chosen quantity, 
this could be achieved by substituting that quantity for oi in formula (16.22). 

Achieving Any Desired Mean for a Composite. In the complete regression 
equation, in order to make the mean of the predictions equal that of the 
obtained values, the a coefficient is introduced. The computation of a is 
given by formula (16.15). After one has determined any weights whatever 
to apply to the raw scores of the components of a composite measure, the 
same formula can be applied, putting in the place of M, any desired quantity. 
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This is true because of the reasoning involved in the computation of the mean 
of a composite [see formula (16.16)]. Thus, if we had wanted the mean of 
the grades predicted by the regression equation on p. 410 to be 50, we would 
have substituted 50 instead of 73.8, the actual mean of the grades. The only 
practical restriction would be to choose a mean such that no composite meas- 
ures would be negative. This means that any chosen mean should be at 
least 2.5 to 3.0 times the standard deviation of the composite. 

Substitutes for Regression Weights. While regression weights derived 
from least-square solutions, or weights proportional to them, yield the greatest 
accuracy of prediction from the variables available, it is often expedient 
in the practical situation to deviate from the refined solution. It can be 
shown that we may substitute weights that approximate the regression coeffi- 
cients, even very roughly at times, and still not affect the degree of correlation 
very much. Instead of applying weights to three decimal places, one signifi- 
cant digit will often suffice, in other words, simple integral weights. 

In predicting freshman grades from high-school average and interest score 
combined, for example, we found the optimal weights to be .224 and .491. 
We might in practice round these to .2 and .5, respectively. It will be shown 
later! that the change in correlation between X7 and X, in the two cases is 
from ,578, with the three-digit weights, to .577, with the one-digit weights. 
Surely, this loss is quite trivial. We could use weights of 2 and 5 had we so 
chosen, Suppose we want even a simpler ratio of the two weights, like 1/2, 
rather than 2/5. With weights of 1 and 2, also, the correlation of composites 
and grades would be 577. With equal weights the correlation would drop to 
570. Even this much loss could be tolerated. 

Before the reader draws the conclusion from this isolated example that all 
differential weighting is unnecessary, however (many generalizations, unfortu- 
nately, are just as sweeping as this would be), it is necessary to consider some 
points not yet brought out. There is no reason to believe that this is a 
typical example. Ordinarily, the more independent variables in a composite, 
the more can one depart from the weights demanded by least-square solutions 
and yet maintain a high level of correlation between that composite and a 
criterion to which the weights apply. This is why with a test of many items 
we may forget to bother with differential weighting. In a two-variable com- 
posite, however, we have the minimum number. We should therefore expect 
to find the validity of the composite to be rather sensitive to changes in 
weights. 

Roughly, the explanation in this example is that X, (high-school average) 
has a beta weight about 2.4 times that for X; (interest score) and it has a 
standard deviation about five times as large as that for Xs. Even when X, 
and Xs have the same weight in the composite, X, contributes to the com- 


! Methods for correlating composites or sums, either weighted or unweighted, will be 
described beginning on p. 425. 
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posite in proportion to its standard deviation. This follows from equation 
(16.17) in which it is shown that without differential weights each part's con- 
tribution to total variance is proportional to its own variance. Without differ- 
ential weighting factors in the equation, then, X, is still weighted much more 
than Xs. This illustrates a fact that is not often realized. It is usually 
assumed that merely summing several scores weights those scores equally. 
As a rule, it does not; it weights them in proportion to their standard deviations. 
In more common-sense language, tests weight themselves. 

Weighting Measures Inversely as Their Standard Deviations. This discus- 
sion leads to the conclusion that if we really want to weight tests in a battery 
equally we should apply to each one a weight inversely proportional to its 
standard deviation. Without information as to the validities of the tests and 
of their intercorrelations, that would be a reasonable thing to do. It is some- 
times done. Table 16.10 shows how this end may be achieved. The four 
tests are the same as those used to predict freshman grades. The means and 
standard deviations are duplicates of those given in Table 16.1. 


Taste 16.10. THe Process or WEIGHTING COMPONENTS INVERSELY AS 


THEIR DISPERSIONS 
—ñ —. — ——— — 


Variables 

4 B с р 
M 19.7 49.5 61.1 29.7 
«ЫЗ s 5.2 17.0 19.4 3.7 
19.4% (20). . 3.73 1.14 1.00 5.24 
Integral weight () Abus 1 1 5 
Estimated importance (7) 2 2 5 1 
Combined weight (Iw)... 7.46 2.28 5.00 5.24 
Revised integral weight (NJ. b 1 7 2 5 5 
Simplified weight (///2.28)......... 3 1 2 2 


We could find a weight equal to 1/a for every test, but these weights would 
be rather small decimal numbers in some cases. A good practical procedure 
is to select the largest o in the list, in this case 19.4, and to compute the ratio 
19.4/o for each test. The test with the largest o will have the smallest weight. 
With this particular ratio, the smallest weight will then be exactly 1.0. The 
ratio of any other weights to this one will be immediately apparent. It is 
recommended that all these ratios be rounded to the nearest integer, as shown 
in the fourth row of Table 16.10. The weights obtained by this process are 
4, 1, 1, and 5, respectively. With these weights applied, all four tests would 
contribute approximately the same amount of variance to the total variance, 

The principle of weighting each test inversely as its dispersion is involved 
in the ö coefficient. Remember that b is equal to beta times o,/o;, where о; 
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is the standard deviation of the test to be weighted. Using this procedure, 
therefore, is virtually equivalent to using an incomplete Б coefficient. It 
virtually assumes equal validities for all tests and equal intercorrelations, 
conditions which would lead to equal betas. 

From the solution in Table 16.10, measures X, and X; should receive 
weights of 1 and 5, respectively. The difference is in the same direction as 
for the two b weights, which are .224 and .491, respectively, but X, is given 
relatively about half as much importance as it should have. The effect upon 
the correlation of the composite, weighted this way, is to reduce it from the 
optimal R of .578 to a correlation of .558. The underweighting of X4, which 
is more valid and has a larger beta than X;, shows up in the lower validity of 
this composite. 

Other Principles of Weighting. Common sense may suggest that component 
tests should be weighted in proportion to their lengths or their means or other 
obvious properties. To do so may lead the uninformed investigator astray. 
If two tests of unequal length are equally effective, in the sense that they pro- 
duce dispersions in proportion to their lengths, when no weights are applied 
at all they are automatically weighted in proportion to their lengths. Attach- 
ing more weight to the long test thus merely exaggerates an effect we already 
have. There is no real justification for weighting tests in proportion to their 
means, and, when means are proportional to standard deviations, the policy 
would again carry the weighting further in the same direction. 

If parts are regarded as really of equal importance, then a correction such 
as was described above would be in order. If the traits measured by different 
tests are regarded as differing in importance, and if we can decide upon ratios 
of importance, we can combine weights based upon these ratios with whatever 
weights we already have. Suppose, for example, we thought that the four 
variables in Table 16.10 are important in the ratios 2, 2, 5, and 1. Two 
weights for a variable are combined by finding their product. In Table 16.10, 


four products in Table 16.10 are 7.46, 2.28, 5.00, and 5.24, respectively. 
Rounding these, we have 7, 2, 5, and 5. To simplify these still more, if we 
let the smallest weight equal 1, the others can be expressed as integral multi- 
ples of 1 (found by dividing every product by 2.28). The simplified, com- 


Some investigators believe it important to consider reliabilities of measures 
in weighting them in combinations. By reliability here is meant consistency 
of scores as indicated by some kind of a self-correlation. If regression weights 
have been computed, reliabilities have been automatically taken into account 
and no modification of the weights for reliability would be necessary. Butif 
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some other method is used to arrive at weights and if the measures combined 
differ markedly in reliability, then some index of reliability should be con- 
sidered. This tends to avoid giving “errors of measurement" in the less 
reliable instruments too much weight. If reliability coefficients have been 
computed, the weight contributed from this source should be the square root 
of each reliability coefficient, rather than the reliability coefficient itself. 
The type of reliability coefficient should be one indicating internal con- 
sistency, i. e., an odd-even type or a Kuder-Richardson type (see Chap. 17). 

The Correlation of Composite Measures with Other Measures. The 
multiple R is only one index of correlation between a composite measure and 
some other measure. It applies to a composite in which the weighting has 
been optimal, with weights determined by the least-square solution. To test 
the predictive value for composites with other than optimal weights, we have 
other procedures known under the heading of correlation of sums. The com- 
ponents may be unweighted (i. e, each weight is, +1) or differentially weighted. 

Correlation of a Composite of Unweighted Measures. The simplest case is 
solved by the equation! 


(Correlation of a sum of two un- 
fa = гз fags аса D weighted components with a (16.23) 
Vo 1 ＋ 0% + 27120102 third variable) 


where oi and оз = standard deviations of the two components and 7.1 and 
гез = correlation of each component with the third variable. 

Let the illustrative summation equation be X, = X, + Xs, where X, 
stands for a sum of X4 and X;, which in recent illustrations have stood for 
high-school average and interest scores, respectively. What is the correlation 
of X, with freshman grades, which here are symbolized by Xe? Applying 
formula (16.23), 

м (.546)(19.4) + (.365)(3.7) 
fa = 19.48 3:7 + 2(.345)(19.4)(3.7) 
= 570 


When there are more than two components, the more general formula for 
the same kind of correlation is 


(Correlation between a sum of un- 


Ta = aie weighted variables and another single (16.24) 
м Do% + 2 Tri variable) 


where г = correlation between any one component X; and the outside single 
variable (i varies from 1 to n) 
оң = standard deviation of the same component 
rij = correlation between X; and any other component Ху, when j isa 
higher subscript number than i* 


1See Appendix A for proof. 


* Here, as in similar formulas, regie; implies covariances of all possible pairs of variables. 
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Correlation of a Composite of Weighted Measures. When there are two 
components, each weighted differently, the correlation with a third measure 
is given by! 

(Correlation of a 


sum of two 
T ERREUR ME RU ce (16,29) 
Vw 071 + w0? + 27710 120202 ures with a third 
measure) 


where wi and ws = weights attached to measures X; and Xo, respectively, 
and other symbols are as defined in formula (16.23). 

For the combination of high-school average and interest scores, let us 
assume weights of 2 and 5, respectively. These are closely proportional to 
the b coefficients of .224 and .491, respectively. Applying formula (16.25), 


T 2(.546) (19.4) + 5(.365)(3.7) 
(APO A AE 25 (3.72) + 2(.345)(2) (19.4) (5) (3.7) 
= 577 


Thus, crude, integral weights of 2 and 5 would give as high a correlation of 
the combination of X, and X; with X; (freshman grades) as would the three- 
digit 6 coefficients .224 and .491. 

For the general case, with more than two components, the correlation with 
an outside variable is 


Zuwirac; (Correlation of a weighted 
Реан) = aaam sum with an outside (16.26) 
M Tb + 2Zrgquisavjs; variable) ( 


where the symbols are as defined in preceding formulas. 


ALTERNATIVE SUMMARIZING METHODS 


Summative equations represent only one way in which several measures 
may be combined in order to reach single predictions or decisions. There 
are alternative methods, some of which are better than regression equations 
in certain situations. The two chief contenders are the multiple-cutoff 
method and the profile method. These will be described and their variations 
discussed. 

Multiple-cutoff Methods. In a multiple-cutoff method, a minimum quali- 
fying score or measure is adopted for each variable used in making a joint 
prediction. A good example of the method is the medical examination in the 
qualification of individuals for military service, for life insurance, or for 
employment. Failure to meet the standard on any one test may disqualify 
the individual. Making a particularly good showing in one respect is not 
ordinarily allowed to compensate for a poor showing in some other. The 

1 For proof, see Appendix А. 
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phenomenon of compensation, which the regression-equation approach 
allows, is the chief difference between the two methods, in principle. 
Multiple Cutoffs Contrasted with Multiple Regression. A geometric illustra- 
tion of the difference between the two methods may be seen in Fig. 16.4. The 
two variables represented there (Хз and Хз) are both independent variables, 
used jointly to predict some criterion X; which is not shown. A moderate 
correlation, of approximately .40, is assumed between Хз and Хз, as repre- 
sented by the familiar elliptical distribution of the population. Let us assume 
a selection problem and that we have the alternative of applying two cutoff 
scores Xs, and Xs, or of applying a single cutoff score based upon a weighted 


КДА 
Fic. 16.4. Geometric comparison of accepted and rejected personnel by the multiple- 
regression-equation method and by the multiple-cutoff method, when approximately equal 
proportions are selected by either method. (After R. L. Thorndike, AAF Report No. 3.) 


sum of X; and Хз. Assume also that we reject the same proportion of the 
applicants by either method. 

The use of two cutoff scores would reject all individuals to the left of the 
point Xe, and a vertical line erected at that point, also all individuals below 
the point X; and a horizontal line drawn at that level. Some individuals 
would be rejected on the basis of either variable alone and some on the basis 
of failure to meet standards on both. The single cutoff on the weighted com- 


posite, however, would be represented by а slanted line. This is consistent 


with the slanted-line system shown in Fig. 16.2. All individuals below and 


to the left of this slanted line would be rejected. 
ind of individuals would be accepted by the 


It is now possible to see what ki ] 
one method and rejected by the other and on which ones the two methods 


agree. The individuals in area A of the ellipse would be accepted by either 
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method. The individuals in area R would be rejected by either method. 
Individuals in area B would be rejected by the multiple-regression-equation 
method but would be accepted by the multiple-cutoff method. Individuals 
in areas C and D would be accepted by the regression method but rejected by 
the cutoff method, those in C for different reasons than those in D. 

The crux of the comparison of values of the two methods lies in determining 
whether individuals in area B are any better in the criterion than those in 
areas C and D. Individuals in area В are rejected by the one method because 
they combine below-average scores in Xa and Хз. They just succeed in meet- 
ing minimum standards in both variables and so would be accepted by the 
other method. Individuals in areas C and D, although below standards in 
one variable, are allowed to present compensating strong scores in the other 
variable and hence to be accepted by the one method. They are regarded as 
doubtful risks by the other method. 

It can be argued that not enough is known about compensatory effects in 
performances that serve as criteria, and that is quite true. There should be 
some experimental studies of this kind. A vindication of the regression 
method, however, is found in the consistency with which composite scores 
continue to correlate as they do in line with multiple-correlation coefficients 
that forecast those correlations. If compensatory effects did not occur, there 
would probably be much more shrinkage in correlation of sums with criteria 
than there is. 

An Evaluation of the Multiple-cutoff Method. Tf all regressions are linear, 
theoretically, there should be no advantage in selection by multiple cutoffs 
over that by composites. This can be explained roughly by the fact that in 
a linear regression there is a continuous improvement in criterion measures 
with increased score in an independent variable, and at a constant rate. 
Thus, so far as the relationship between the test and the criterion is con- 
cerned, there is no more reason for putting the cutoff at one point rather than 
another. The cutoff would have to be established on the basis of some other 
determiners, such as success ratio or validity. In using a number of tests for 
selection for a single purpose, presumably it would be best to make the most 
rejections on the basis of the most valid test. Whena regression is definitely 
curved, there is a real basis for using a cutoff on a single test. The cutoff 
would be established in line with the region of transition between low and 
high rates of increase in the criterion measure. For example, in Fig. 15.12, 
somewhere between the scores 90 and 100 would be a good division point, 
taking advantage of the rapid increase in criterion values as scores increase 
in X, and at the same time recognizing that above a score of 100 there are no 
appreciable differences in criterion values as X changes. 

There are some practical difficulties in the administration of multiple 
cutoffs which make the method less appealing than a regression equation. 
There is the difficulty of establishing several different cutoff points which will 
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take full advantage of the differences in validity among the tests and which 
will yield the appropriate numbers of qualified applicants. Once the mini- 
mum standards are established, however, the method is simple to apply. 
Failure to meet any one of the minimal scores automatically means rejection. 

Rejection of an applicant on the basis of a single test is somewhat risky as 
compared with rejection on the basis of a composite score, because of the fact 
that the reliability of a single test score is usually less than that for a com- 
posite. If the parts of a composite are positively intercorrelated, the total 
score is more reliable than the part scores. 

Some Variations of the Multiple-cutoff Method. A distinction is made 
between a simultaneous-hurdles method and a successive-hurdles procedure in 
testing programs using multiple cutoffs. In the former, all applicants take 
all tests; in the latter they do not—they continue to take tests only as long as 
they continue to qualify on them. After the first failure they are rejected. , 
In the latter method it is good practice to administer the most valid test first. 
It is the one on which the largest number of rejections should be made, It is 
desirable, too, that if a single attempt is to be decisive for so many individuals, 
the decision should be made on as good a basis as possible. If a test of very 
low validity were given first, some who could qualify on the valid test would 
never have a chance to take it. Such individuals might be expected to fail 
when they took the invalid test later, of course, but remember that tests are 
not perfectly reliable, and a person might pass a certain test on one day and 
fail it on another. The successive-hurdles method has the great practical 
advantage of saving in testing time. If there are many more applicants than 
openings, large numbers of applicants can be screened and eliminated from 
further testing by means of a single preliminary examination. 

Other variations in using the multiple-cutoff principle have to do with rules 
concerning rejection. It is not necessary to base a rejection on one test alone. 
The rules might allow for failure on not more than two, or any selected num- 
ber of variables. The rules might be refined to the extent of considering 
pairs or triads of tests. Rejection might be reserved for those who fail on test 
M only if they also fail on test N, and so on. Such refinement, however, 
must be based upon good evidence that it pays in terms of better selection. 
For most purposes such evidence is lacking. 

Profile Methods. For guidance work and clinical work in general there is 
common preference for seeing an individual’s scores represented in a pattern 
provided ina profile. A single summative score is unsuitable or may be unob- 
tainable. A single composite score is unsuitable perhaps because the problem 
is not a selection problem but a classification problem. In vocational 
guidance, clients are “sorted” into vocational categories. If there were sin- 
gle summative scores already established with satisfactory correlations with 


1 Toops, Н. A. Philosophy and practice of personnel selection. Educ. psychol. Measmt., 
1945, b, 95-124. 
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vocational criteria of many kinds, perhaps the profile method would be less 
important. Clinicians commonly express a desire to “see a personality in its 
totality,” however, and a profile is one approach to this end. Ж e 

'There are several ways of using profiles. Some prefer the intuition given 
by a general impression of a plotted graph for an individual. Others prefer 
to match more definitely described job-requirement or adjustment-require- 
ment patterns with individual trait patterns. It is possible, by means of 
careful research, to define certain adjustment requirements in terms of opti- 
mal scores in a number of different variables. This statement implies curved 


Chor 5 т o c L с ^ Ly 1 N o м Co Сое 


Fro, 16.5. An illustration of the profile method of selection applied to personality-inventory 
scores. The clear portion of the chart represents what is believed, on the basis of experi- 
ence, to be the most favorable score ranges for personnel who are assigned to a certain 
routine type of work. The scores of the worker shown all fell within the favorable region. 
(Courtesy of R. P. Kreuter, Hand Knit Hosiery Company, Sheboygan, Wis.) 


regressions, and that is precisely the condition which favors the choice of a 
profile method to a regression method. 

Figure 16.5 demonstrates this kind of use of a profile. By experience, it 
was found that female workers in a certain kind of routine task tended to be 
most suited to the job if they had scores in certain regions on the 13 traits 
scored in the Guilford-Martin personality inventories. Such workers were 
likely to be best if somewhat shy or reclusive, a little on the depressed and 
emotional side, less active than average (the task was sedentary), less 
ascendant socially, somewhat beset with feelings of inferiority, somewhat sub- 
jective or hypersensitive, and perhaps none too agreeable or cooperative. In 
most respects the tendencies listed would seem to present a generally “poor” 
personality picture. Low extremes were unfavorable, however; the general 
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tendency was just average or slightly below in most traits. This is under- 
standable in that such an individual is probably lacking in aspirations for 
positions that require the better qualities and is contented with a routine 
type of work in which adjustments to social requirements are relatively easy. 
The profile is shown of a certain individual who was rated very high in per- 
formance at her task. 

For selection purposes, a profile may be handled in various ways. The one 
shown in Fig. 16.5 illustrates one procedure. The favorable zone is clear, and 
less favorable zones are crosshatched. The crosshatching can be overprinted 
on the chart or a plastic mask can be prepared to lay over individual charts. 
Decisions can be based upon the number of favorable scores or upon the trend 
of the individual’s curve as compared with the trend of the optimal scores. 
If a single optimal score has been determined for every trait, and an “ideal” 
profile has been drawn, the departure of a single profile from the ideal profile 
can be determined in various ways, none of them highly satisfactory. The 
deviations of each person’s scores from the ideal scores can be summarized in 
various ways. A way that meets common statistical principles would be to 
square the deviation, sum the squares, find a mean, and then a square root. 
This would give a single summarizing statistic that has some statistical 
sanction. ‘There are many who would want more than such a number, how- 
ever, for it does not tell us where the deviations are. 

Classification of Personnel. Selection of personnel presupposes a supply 
of applicants and the possibility of rejecting a proportion of them. Attention 
is upon one kind of assignment to be filled. In the classification of personnel, 
there are two or more assignments that can be made and one might even con- 
sider rejecting none, provided proper assignments can be found for all. In 
some situations there is the double problem of selection and classification 
combined. The availability of more than one assignment, however, makes 
possible the utilization of many more applicants than would be true if there 
were only one kind of place to fill, for, presumably, personnel who do not 
qualify for one place might well qualify for some other. The more different 
kinds of places there are to fill, the smaller the chance of any applicant’s 
being rejected for every kind. 

Classification, broadly defined, means assigning individuals each to his 
most appropriate category. This would include the operations in educational 
and vocational guidance. In vocational guidance, the number of kinds of 
“assignments” is almost infinite, though the number of major categories is 
limited. In selection we have an assignment with the need to find the person 
for it; in classification in general, we have a number of assignments with their 
requirements in terms of human resources, on the one hand, and a number of 
persons who have the resources to satisfy or not to satisfy each assignment on 
the other. In vocational guidance, we have one individual, with a unique 
pattern of resources, on the one hand, and a large variety of possible occupa- 
tions on the other. 
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As demonstrated in this and in preceding chapters, we have solved many 
of the statistical problems involved in selection of personnel. These are 
bound up with the problems of prediction and of how to evaluate the goodn: 
of prediction. By contrast, the problems of classification have been solvi 
more slowly. Assignment to alternative classes requires a differential predic- 
lion, rather than a prediction on a single variable. We have to predict how 
much better the individual will adjust or perform if assigned to one category 
than if assigned to some other category. 

When only two assignments are being considered and two predictive indices, 
we attempt to predict a difference in the criterion variable (or between 
criterion variables) from a difference in the assessment variable (or between 
assessment variables), It is reasonable that the more independence between 
two criterion variables (the less they intercorrelate), the more easily we can 
make a differential, prediction. The more easily, also, could we find rela- 
tively independent assessment variables. Lack of correlation between both 
the criterion measures and the assessment measures seems to be very impor- 
tant for effective classification.' 

Classification through Selection. Whether we have two or whether we have 
more than two alternative categories in which to place individuals, an 
approximate solution lies in the application of selection procedures. For 
each vocational category to be filled, we can derive a multiple-regression 
equation, where the criterion to be predicted is a measure of success in that 
vocation, The differences between composite scores would be the deciding 
factor in classification. If possible, each person would be assigned to that 
category for which he has the highest composite score. Profile methods could 
also be used. With an optimal profile developed for each category, and a 
method of comparing the extent to which an individual's profile approaches 
different profiles, decisions could be reached. 

Use of the Discriminant Function in Classification, A better procedure, 
that introduces more directly the principle of differential prediction, is use of 
the discriminant function, This is another statistic that was originated by 
Fisher. The general principle is that the different scores or measures will be 
weighted in such a way as to maximize the difference between the means of 
two composites derived from two criterion groups, relative to the variance 
within those groups. Suppose that we have two groups of successful indi- 
viduals in two vocations—selling life insurance and piloting airplanes. We 
also have scores from all individuals in the two groups from several tests. 
We want to weight the tests (with the same weights applying to both groups) 
so that the means of the composite scores would differ as much as possible. 
The overlapping of the two distributions of composite scores would then be 


! These problems have been discussed at greater length by Thorndike, R. L. Personnel 
Selection. New York: Wiley, 1949; and Brogden, H. J. An approach to the problem of 
differential prediction. Psychometrika, 1946, 11, 139-154. 
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. 
as small as possible. The result would be that an F ratio or a ratio would 
be a maximum. 

We can approach the problem from the correlation point of view if we look 
At it in a different way. If we assign the criterion values of 1 and 0 to the 
two groups (which group is 1 and which is 0 does not matter), and if we treat 
the group differentiation as a genuine dichotomy, we have a multiple-point- 
biserial problem, as demonstrated by Wherry.' That is, the dichotomy is a 
criterion to be predicted by means of a multiple-regression equation, in which 
the components are optimally weighted. The information with which we 
start would be a point-biserial r between each measure and the criterion and a 
Pearson product-moment r (preferred) among the measures of assessment. 
The procedure for determining the weights in the regression equation would 
be the same as illustrated in, this chapter. The SD of the criterion would be 
pq, where p = proportion in one of the groups. A multiple-point- 
biserial R can also be computed to indicate the goodness of prediction afforded 
by this equation. 

The cutoff point to apply to the composite scores so as to make the best 
classification of individuals (smallest probability of error of classification) 
would conform to the procedures described in Chap. 14. In fact, the predic- 
tion of category from measurements in that chapter is based upon the same 
principles as those involved in the discriminant function. 

When there are more than two classes to be predicted, the multiple-regres- 
sion problem becomes quite complicated. There have been a number of 
attempts to solve the problem, of which one by Horst is а good example.“ 

DATA 164. INTERCORRELATIONS OF ScORES FROM FOUR EXAMINATIONS AND Marks 


RECEIVED IN FRESHMAN MATHEMATICS 
(N = 100) 


Exercises 
In connection with each exercise, state your conclusions and interpretations. 
1, Using information obtained from Data 16A, derive a regression equation involving 
X; (dependent variable) with X; and X. Compute the multiple R and its standard error. 


1 Wherry, R. J. Multiple bi-serial and multiple point bi-serial correlation. Psycho- 


metrika, 1947, 12, 189-195, "i 
Horst, P. A technique for the development of a differential prediction battery. 


Psychol. Monogr., 1954, 68, No. 380. 


434 FUNDAMENTAL, STATISTICS IN PSYCHOLOGY AND EDUCATION len. 16 
Хх = Ohio State psychological examination. 


X, = engincering-aptitude examination. 7 


Data 164, with a multiple R and its SE. 
4, Two students, A and B, have the following scores: 


[^ 


Estimate their most probable marks in freshman mathematics, using the regression equa- 
tions derived in Exercises 1, 2, and 3. 

5. Compute the standard errors of multiple estimate, coefficients of multiple deter- 
mination and multiple nondetermination, and indices of forecasting efficiency for the 
problems in Exercises 1 and 3. 

6. Compute SE's of the regression coefficients in Exercise 1 and the E ratios. 

7. Apply the shrinkage formulas to the multiple R’s and the SE's of estimate in con- 
nection with Exercises 1 and 3. 

B. By the iterative method, solve for the optimal beta weights for all variables in 
Data 164. Compare them with the beta weights found by the Doolittle solution. 

9. Estimate the means of the combinations of scores by the regression weights found 
in Exercises 1 and 3. 

10, Estimate the standard deviation of: 

à, An unweighted combination of scores X; and X, in Data 164. 

b. A weighted combination of the same scores, using the regression weights found in 
Exercise 1, Check by the product oN s- 

€, A weighted combination of the same scores, using weights of 2 and 5, respectively. 

11, Find the correlation of: 

в. An unweighted combination of X, and X, with X.. 

b, A weighted combination of the same variables with XI, using weights of 2 and 5, 
respectively, 

Compare these correlations with the multiple Ri, se 


Answers 


1. Х = 328Х; + .505Х, + 1.64; Rin = 049; on = 059. 
2. Х| = 570Х, + 299X, + 1.12; Rem = Sehen 
3. Bin = 146; % = 096; йу, = 422, Bis = .187; 
Xie AM, + Mex, + ASIN, + 211Х, + 79; Rina = 674; % = ‚056. 
4. Xj (equation 1): 5.3; 6.8; P еі (equation 2): 6.1; 4.3; X, (equation 3): 5.3; 6.4. 
S. eis = LBA; nona = 1.63; Rs, = А21; R¹ = 454; Күз = .579; 
BY snes = 546; Erg = 239; By sues 22.6. 
6. eina = 091, og, = 092; 0, = MOS ene ** 098; fim 2.85; аз = 5.16. 
7. asm = 1.86; tron © 1.66; Rise = 639; Riss = ‚656, 
8. (Same as for Exercise 3.) 
9. Mus: 4.06; 4.91. 
10. (a) e, = 3.66; (b) e, = 1.57 (check: оК = 1.57); (с) e, = 13.73. 
M. (а) ree = .644; (6) raus) = OAS. 


CHAPTER 17 


RELIABILITY OF MEASUREMENTS 


The Importance of Reliability. Much of what was said in previous chap- 
ters assumed that measurements were perfectly reliable, or nearly so. By a 
perfectly reliable measurement we mean one that is completely stable or 
fixed. The same “yardstick” applied to the same individual or object 
should yield the same value from moment to moment, provided the thing 
measured has itself not changed in the meantime. 

There are times, both in theoretical investigations and in practical work, 
when it is very important to take into account the question of reliability. 
Although numbers, as such, are exact concepts, just because we amass а 
series of numbers attached to individuals or to observations is no assurance 
that those numbers mean much at all about the things measured. 

There is no way of just looking at numbers and telling whether or not they 
stand for any real values or could possibly have been “pulled out of a hat.” 
Some samples of measurements actually approach the chance condition just 
implied. Others are not exactly “chance” collections of numbers, but there 
is a strong element of chance involved in them. 

Conclusions to be derived from the very same statistical results might differ 
considerably whether we know the measurements to be highly reliable or not. 
Differences and correlation coefficients may often prove to be insignificant 
merely because the measures used were lacking in reliability. Thus, the 
matter of reliability well merits considerable attention. 


RELIABILITY THEORY 


It is impossible to appreciate the many problems that arise in connection 
with reliability and the several meanings of the term itself without going into 
some of the mathematical ideas underlying the concept. The reader will 
find that on the one hand there isa rigorously defined conception of reliability 
from which it is possible to understand many of the peculiarities of measure- 
ments, particularly those called test scores. On the other hand there are 
several operational conceptions of reliability, depending upon how it is esti- 
mated from empirical data—such as internal-consistency, test-retest, and 
alternate-forms methods. Keeping in mind the fact that there are several 
kinds of reliability and that operational definitions and logical definitions do 
not coincide will aid a great deal in thinking about problems of reliability. 
We shall begin with the basic, theoretical conceptions of reliability. 

435 
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The Basic Definition of Reliability. The reliability of any set of measure- 
ments is logically defined as the proportion of their variance that is true variance. 
Before elaborating upon the heart of this statement, which is the last part, 
attention should be called to the more incidental part. The statement Ьер 
with "the reliability of any set of measurements," Note that it is the meas- 
urements that are said to have the property of reliability rather than the 
measuring instrument. ‘That is because in psychological and educational 
measurement, and other social measurements, reliability depends upon the 
population measured as well as upon the measuring instrument. It can rarely 
be said of any instrument, test or other device, that the reliability of that 
device is of a certain value (usually in the form of a coefficient of correlation). 
Reliability is of a certain instrument applied to a certain population under certain 
conditions. 

The next comment on the definition, and a more important one, is in 
definition of /rue variance. The idea of variance itself is not new. The total 
variance, which we will now call o, of a set of measurements is the mean of 
the squares of deviations from the mean of the measurements. The idea 
of separating total variance into components is also not new. That idea 
was emphasized in the chapter on analysis of variance (Chap. 12) and in the 
chapters on prediction of measurements (15 and 16). Here we make a new 
kind of segregation of variances. We think of the total variance of a set of 
measures as being made up of two sources or kinds of variance: /rue variance 
and error variance. We think of each single measurement, also, as having 
two components: a true measure and an error. In terms of an equation, 


X, = X. M X, (An obtained measure expressed as the sum of a true (17.1) 
and an error component) 
where X, = obtained score or measure 
X. = true score or measure 
X, = error increment or component 

Several assumptions are made in connection with this equation. The /rue 
measure is assumed to be the genuine value of the thing measured, a value we 
should obtain if we had a perfect instrument applied under ideal conditions. 
Another conception is that it is the mean value we should obtain for the 
object if we measured it a very large number of times. There is no incon- 
sistency between these two conceptions. Any obtained measurement at a 
particular moment is determined in part by the true value and in part by 
conditions which bring about a departure, perhaps, from that value. 

In measuring a series of objects, it is assumed that the error components 
occur independently and at random, that their mean is zero (they increase 
as often as they decrease a measurement), and that they are uncorrelated 
with the true values and with errors in other measurements. The assumption 
that the mean of the errors is zero is not essential but it is convenient. These 
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conditions may not always be satisfied. Without evidence to the contrary 
we assume that they are satisfied. Knowledge of the instrument and of the 
a conditions of measurement is sometimes sufficient to lend support to 
ese assumptions or to cause us to reject them in any particular situation. 
Reliability was defined as the portion of the total variance that is true 
variance. The three variances, true, error, and total, are illustrated in 
Table 17.1. There we have a set of 10 hypothetical, true measures whose 


TABLE 17.1. DISPERSION Or TRUE Measures, ERROR COMPONENTS, AND THEIR Sums, 
THE TOTAL MEASURES, WITH MEANS, VARIANCES, AND STANDARD DEVIATIONS 


25 25 
25 35 
30 26 
35 33 
45 45 
2 25 250 
M 250 25.0 
zs 1050 1202 
o 105.0 120.2 
„ 10.2 11.0 


с. с. we 


mean is 25 and whose variance is 105.0. For each true measure we have a 
corresponding error component that is to be added to it to form a total, or 
obtained, measure for the individual. The mean of these error components 
is zero, as assumed above. Their variance is equal to 15.2.. 

The variance of the total measures can be estimated from the component 
variances by using formula (16.17) of the preceding chapter. It is merely 
the sum of the two component variances. In the new symbols, 


a, = ot, +0, (A total. variance as the sum of true and error vari- (17.2) 
* ances) 


The application of this equation in Table 17.1 gives a total variance of 120.2, 
which checks with that computed from the sum of squares of X. 
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Tn satisfaction of the definition of reliability, we need to find the proportion 
of total variance that is true variance. If we divide equation (17.2) through 


by ge we have proportions: " 


а?, 22 (Sum of proportions of true and error 
2 100 (185 (17.3) 


In symbolic form, the reliability of these measurements is given by the ratio 
0 , or in another form by 1 — %. In other words, the reliability is 
measured by the ratio of true variai.ce to total variance, or by one minus the 
ratio of error variance to total variance. Letting ru stand for the coefficient 
of reliability, we have two alternative equations: 


o1, 
ти = Pol 
and (Basic equations for ће coefficient of reliability) (17.4) 
ке 
u 1 
„= BU 


Bor -—- 
Tu 1202 7 87 
15.2 
or = | -= = 
та = 1 120.2 87 


If we let e? stand for the proportion of error variance in the total, we have 
the equation 


rate = 1.00 (Complementary nature of ti f 
] pal у proportions of true and (17.5) 


The previous relationships are demonstrated pictorially in Fig. 17.1 and 
Fig. 17.2. In Fig. 17.1 dispersions of true measures and of total measures are 
shown. Both have the same mean. The standard deviation o is greater 
than o This is always true, unless they happen to be equal. The effect 
of errors of measurement js always to increase obtained dispersions, never to 
decrease them, unless they should happen to be correlated with the true meas- 
ures or with each other. 1 

Incidentally, this suggests that standard errors of means and other statistics 
which are estimated from obtained o’s, are inflated values when measures are 
at all unreliable. Tests of significance are therefore reduced in power by 
ufireliability. The only remedy is to improve reliability of measures or to 
increase the size of sample to compensate for errors of measurement. There 

are no known corrections to apply, nor could they. probably be justified. 
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Figure 17.2 presents the picture in a somewhat different manner. Here 
the summative properties of variances are apparent. Without the assump- 
jon of zero correlations for the errors, such a simple picture would be impossi- 
LA This kind of representation of variances, in tests particularly, will be 
encountered with increasing frequency in this and the next chapter. 
The Index of Reliability. The reliability coefficient for a test, ru, as 
described thus far, is merely an abstract idea. Operationally, it is some kind 
of self-correlation of a test. 


Fic. 17.1. Distribution of obtained scores in a test (solid curve) and of the hypothetical 
true components of those scores (dotted curye). Means of obtained and true scores coin- 
cide, on the assumption that errors of measurement have a mean of zero. The standard 
deviation of the obtained scores is larger than that of the true components. 


Amounts of variances 


[e —— Fre > Error | 


Proportions of variances 
Fic. 17.2. Amounts of true and error variance (first bar) in a test; also proportions of true 
| and error variance (second bar). 


Before we go into the various operations for estimating 7и, let us add more 
fundamental meaning to the idea of reliability. Let us think of the true score 
| (Xa) and the obtained score (X;) as being two separate variables, the one 
dependent upon or predictable from the other. This is in spite of the fact 
that the one includes the other. Think of X, as the dependent variable and 
of X, as the independent variable. In a real sense, х, is determined by or 
dependent upon X. Figure 17.3 shows these two variables as coordinates 
and the line of regression of X; upon X». The correlation between the two, 
which is known as the index of reliability, is fi». The square of this correla- 
| tion coefficient is an index of determination (see Chap. 15) and it indicates 
| the proportion of variance in X, that is determined by variance in X. But 
| this is precisely what the reliability coefficient (ru) tells us. Consequently, 
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we have shown that 


fs = fu (17.6) 
(Relation of an index of reliability to a coefficient of n 

and reliability) 
fus = Ми (17.7) 


Nothing can correlate with obtained scores higher than their correlation 
with corresponding true scores. The statistic г, then, is often used as an 
indication of the higher limit of correlation of any variable with another. 


Range within which 
2/3 of obtained 
scores fall 


Осо 
(standard error 
of an obtained 

score) 


Obtained score M 


True score (Xoo) 


Fro. 17.3. Regression of obtained scores on true scores, with parallel lines drawn at vertical 
distances of one standard error (sta) from the regression line. (Compare this illustration 
with that in Fig. 15.6. The standard error of measurement is essentially a standard error 
of estimate.) 


Since ты is the square root of the reliability coefficient, it is always numeri- 
cally higher than rz. Do not be surprised, then, to find that a test may corre- 
late higher with another test than it correlates with itself. We cannot com- 
pute ri» directly from data, but it can be estimated from ry or from other 
information. It is a seldom used statistic, but it has a definite meaning and 
could be used along with 7и or in place of it. 

The Standard Error of Measurement. Since we can estimate the correla- 
tion between obtained and true scores and can think in terms of prediction of 
one from the other, we can also ask concerning the errors of prediction. We 
know the obtained scores and from them could predict true scores (assuming 
any mean and standard deviation we please for the true-score scale), But 
there is nothing to be gained by so doing, for the predictions would be no 


5 
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more accurate than the scores from which they were obtained. Nothing 
would have happened except a change of unit and zero point. 

Suppose that we think in terms of prediction in the other direction, from 
Є. scores to obtained scores. This is impossible, practically, since we do not 
know the true scores from which to make predictions. Let us think rather in 
terms of determination; of true scores determining obtained scores. But 
errors of measurement also help to determine obtained scores. We are 
interested in the extent of the discrepancies caused by these errors of measure- 
ment, in other words, in the size of distortions produced in the otherwise true- 
determined measurements. The average of these discrepancies is estimated 
by the formula 


Sto = 0; VI = fu (Standard error of measurement) (17.8) 


where т, = standard deviation of the distribution of obtained scores and 
ти = reliability coefficient. 

The standard error of measurement is a standard error of estimate and may 
be interpreted as such! Figure 17.3 shows the limits marked off at distances 
of plus and minus 1 ты from the regression line. In a certain test with à Tto 
equal to 2.0 units, we may say that two-thirds of the obtained scores are 
within 2.0 units of the true scores that determined them. If a certain indi- 
vidual's true score were 35, for example, the odds are 2 to 1 that his obtained 
score would not exceed 37 or fall below 33. Allowing a margin of 20, we can 
say that the odds are 19 to 1 that his obtained score will not exceed 39 or fall 
below 31. 

Any obtained score does not tell us what the corresponding true score is, 
but with knowledge of the ты we have a degree of confidence that the true 
score cannot be very far away. The same standard error gives us some basis 
for confidence as to whether the scores for two persons represent a real differ- 
ence or whether we can tolerate the idea that they could have come from the 
same true score. 

Reliability at Different Parts of the Test Scale. ‘Test users frequently ask to 
know the standard error of measurement rather than the reliability coeffi- 
cient, because it tells them more directly what they wish to know. It tells 
them whether they should be concerned about differences of 2, 4, 8, or 12 
points or whether any or all of these differences are within the probable range 
that could have been produced by errors of measurement. 

It may happen, however, that because of a peculiarity of the test itself, 
discriminations are better at one part of the scale than at other parts. The 
шш Statistic is a blanket index, implying approximately equal discriminating 
power all along the scale. If there is reason to suspect that discrimination is 
actually unequal along the scale, this can be examined by preparing a scatter 
„diagram, showing the relationship between two forms (or halves) of the same 

1 This statistic is also called the standard error of an obtained score. 
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test. The standard deviations of the columns or rows at different score levels 
will indicate where predictions have the greatest accuracy. If the score dis- 
tribution approaches normality and if obtained scores do not extend over the. 
entire possible range, the standard error of measurement is probably uniform 
at all score levels. 

Computing the Standard Error of Measurement from Differences. Rulon has 
devised a way of computing тш directly from differences between scores made 
by individuals on odd and even pools of items.! The equation is 


>й (Standard error of measurement computed from (17.9) 
r SANN differences) . 


where d — difference between two scores of half tests for one individual. A 
rough rationale for the Rulon method is to say that a difference between one 
half score and the other half score for the same person is a measure of the 
error for that individual. Since errors are conceived as deviations, squaring, 
summing, and dividing by N should estimate the amount of error variance. 
That is precisely what o^» signifies—the amount of error variance. Thus, 
oe = 0%, = 0% ce, This fact will be used later as another way of esti- 
mating the reliability coefficient. 


METHODS OF ESTIMATING RELIABILITY 


We leave theory for a while and see how ry can be estimated from empirical 

data. There are many procedures, falling roughly into the three categories: 
(1) internal-consistency reliability, or simply internal consistency; (2) alter- 
nate-forms reliability, or comparable-forms reliability, or parallel-forms 
reliability; and (3) retest reliability, or test-retest reliability. Cronbach has 
recently proposed that we speak of the second and third types of estimate as 
coefficients of equivalence and of stability, respectively.! It would be con- 
venient, also, to speak of the first type as a coefficient of consistency. 

There is no one best way of estimating ru. The type preferred will depend 
upon one’s purposes and the meaning and use one wishes to attach to ry A 
secondary consideration is availability of data in the proper form. Other 
considerations have to do with testing conditions and the kind of test or other 
measure. 

The various procedures differ most in the kinds of things that are allowed 
to be considered as true variance and as error variance, What may be 
regarded as true variance in computing one kind of ru may be regarded as 
error variance in computing one of the others. For the sake of clear think- 
ing, it will pay us to look at some examples of this. 

1 Rulon, P. J. A simplified procedure for determining the reliability of a test by split- 
halves. Harv. educ. Rev., 1939, 9, 99-103. 


2 Cronbach, L. J. Test "reliability": its meaning and determination. Psychometrika, 
1947, 12, 1-16, 
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Contributors to True and Error Variances. On the whole, things that con- 
tribute to an examinee’s making the same score in “repeated” applications of 
a test are contributors to true variance in the obtained scores. The word 
“repeated” is in quotation marks here because the repetition is broadly 
defined to include alternate forms or two halves of the same test. On the 
whole, things that contribute to varying evaluations of performance of an 
individual in a test are contributors to error variance. The sources of true 
and error variances are numerous. Certain of them are of sufficient clarity 
and commonness of appearance to be recognized and named. 

Let the bar diagram in Fig. 17.4 represent the total variance in obtained 
scores of a test. Let c? be that proportion of the total variance that would 
be regarded as true variance no matter what method of estimating ru is 
employed. After all, they should have very much in common. Let ele be 
regarded as those sources of error variance that are unique to the alternate- 
forms method but are regarded as sources of true variance for the other 


Infernal-consisfency reliability 


| poe — 
FEC 
eee reliability | | | 

ee, — 


Alternate-forms reliability 
Fic. 17.4. Proportions of the total-score variance that can be regarded as true variance 
or as error variance, depending upon which type of reliability estimate is made. 


methods. The relative sizes of these portions will vary from test to test. 
Actual examples of ei and of с? will be given shortly. Let e°; be sources of 
error variance particularly when some internal-consistency method is used. 
This portion is also represented as providing determiners of errors for the 
retest method. Finally, let es, be more distinctly the source of error when the 
retest method is applied, but as being a source of true variance for the other 
methods. The actual situation is probably not so simple as this, but it is 
hoped that this much simplicity will contribute to clear conceptions. 

Now for some illustrations of actual determiners of the different kinds oi 
variance. These determiners, it must be remembered, are thought of as 
contributing to individual differences between scores, either within a single 
application of a test or between applications or between forms. Among the 
determiners of individual differences that are consistent from time to time 
and from one form of a test to another is individual status in some enduring 
ability, skill, or other trait or traits. These are the things that we wish to 
measure. Incidental determiners that also belong under portion c? in the 
diagram (Fig. 17.4) are general skill in taking tests, skill in taking this par- 
ticular kind of test, including the form of item used, and possibly the ability 
to understand test instructions. These additional sources of variance are 
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only potential. For any given test, the task may require so little under- 
standing or the type of item may be so well known to all examinees that they 
are practically on a par with respect to these determiners and they conse- 
quently would not contribute to individual differences in scores. If they do 
operate to affect variances, however, they would produce effects in the same 
directions in odd and even scores, and, in so far as individuals do not change 
in these respects from one administration to another, they would contribute 
to true variance in all three types of reliability estimate. 

Determiners that contribute to error variance in the retest method include 
temporary conditions, either of the examinee or of the testing environment, 
including the examiner. The examinee's state of health, fatigue, boredom, 
emotional condition, and the like may well change from one day to another. 
Environmental conditions can vary considerably without affecting scores 
materially, but, in so far as they do, such factors as temperature, humidity, 
lighting, audibility of instructions or signals, ventilation, and the like may 
differ enough to contribute to error variance. 

There are probably more important changes in the examinee himself. 
Having taken a certain test, he is not the same individual when faced with 
the second attempt. The skills and knowledge acquired during the first 
administration and in the interval between will have their effects upon the 
second performance. Memory for answers given on the first occasion may 
lead to repetitions of the same answers the second time and thus contribute 
to apparent true variance. Awareness of mistakes made in the first attempt, 
however, leads to changes in responses and hence to error variance. Besides 
possible improvement during the taking of the test the first time there is possi- 
ble improvement resulting from transfer effects occurring during the interval 
between administrations. There are also possible maturational factors, 
particularly in young children. If learning and maturational effects were 
uniform for all individuals, or in proportion to their initial positions in the 
distribution, these determiners would not contribute to error variance. But 
to the extent that learning and maturational effects differ from person to 
person, they do add much to error variance. 

The longer the time interval between test administration, the greater the 
error contributions, In some tests, continuous loss in reliability occurs as a 
function of time interval between test and retest. In some psychomotor 
tests, self-correlations of .90 to .96 may be found by the odd-even method, 
but test-retest correlations with a year interval between may give correlations 
of approximately .70. Results of this kind were found in testing aviation 
cadets in the AAF before training and again after aircrew tri 
haps some combat. 

Error variance in the alternate-forms method is contributed chiefly by the 
change in content of the test. Knowledge and skill for dealing with one 
particular set of items may vary somewhat from the knowledge and skill for 


aining and per- 


cg. 17] RELIABILITY OF MEASUREMENTS 445 


dealing with another set of items, and these variations differ from person to 
person. In addition, depending upon the time interval between administra- 

ions of the two forms, some of the determiners of error variance just men- 
tioned for the retest method may also apply to the alternate-forms method. 
Aii experiment in the AAF! in which the two forms were given in immediate 
succession and also with 4 hours of other testing intervening showed no 
appreciable change in the size of the self-correlation. Longer periods might 
well be expected to have some effect. 

1f the odd-even technique is used in the split-half method, the changes in 
conditions that may occur during a single administration of a test are rather 
uniformly distributed over all items in both halves so that their effects would 
not show up as error variance. There are other ways of splitting tests into 
halves, however, which may allow more error variance to creep in. If the 
test is divided by blocks of items, as in odd and even half pages, or odd and 
even 2-min. trials, or first half against second half, there is room for sys- 
tematic shifting of conditions. The effects of learning, of temporary changes 
in mental set (as for speed versus accuracy or as to mode of attack on the 
items), or of fatigue or motivation, then might contribute to error variance. 
These are represented in section e*; in Fig. 17.4. 

The determiners of error that would affect all methods of reliability esti- 
mate alike, represented by eie, are such phenomena as fluctuations of atten- 
tion or memory or of motivation that occur from moment to moment or from 
item to item. In some tests, guessing is an important contributor to error 
variance. If a test is so difficult that everyone does considerable guessing 
(in the extreme case assume that every examinee guessed on every item) the 
total scores for all examinees approach chance distributions whose variances 
are very largely error variance. If guessing is a feature in any test, the more 
difficult the test, the lower its reliability is likely to be. On the other hand, 
if the test is too easy, the lower is the dispersion of scores and the lower the 
reliability. The smaller the number of alternative responses, the greater is 
the importance of the guessing feature. True-false tests of the same material 
are less reliable than are four-choice tests, and these, in turn, less reliable than 
tests of the completion form, other things being equal. The moral of this, 
of course, is to avoid items with too small a number of alternative responses 
or to compensate for the greater chance element by making the test longer. 

When Different Methods of Estimating ғ, Are Preferred. Preference for 
one of the three types of reliability estimate depends mostly upon two con- 
siderations: type of test and meaning of the statistic, or purpose for which it 
will be used. 

Homogeneous versus Heterogeneous Tests. Psychological tests can be 
divided roughly into two classes: homogeneous and heterogeneous. The 


1 Guilford, J. P. (ed.) Printed classification tests, in A AF Aviation Psychology Research 
Program Reports, No.5. Washington, D. C.: GPO, 1947. Pp. 257. { 
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former are functionally uniform or, strictly speaking, factorially unique. 
They measure one factor, i.e., one ability or trait. Very few tests satisfy this 
definition completely. Some examples are vocabulary, numerical-operations, 
and perceptual-speed tests. The great majority of tests are factorially com- 
plex. Each one measures at the same time a number of different abilities or 
traits. 

So far as reliability is concerned, other tests may be considered homogene- 
ous if the items are similar in factorial content. That is, if the test as a whole 
measures abilities P, O, and R, and if each and every item also measures those 
threc abilities, for operational purposes the test may be regarded as func- 
tionally homogeneous. An example of this would be an arithmetic-reasoning 
test or a figure-analogies test. 

We expect that homogeneous tests shall be internally consistent—we want 
all parts to measure the same thing, or things; consequently, some form of 
internal-consistency index is called for, unless the speed element is appreciable 
(many examinees do not complete the test). 

If a test is heterogeneous, in the sense that different parts measure different 
traits, we should not expect a very high index of internal consistency. An 
example of such a test is a biographical-data inventory. This kind of test is 
composed of questions concerning the examinee's previous life and experi- 
ences. Each response to every item is usually validated by correlating it 
with some practical criterion, for example, success in pilot training. The 
reason one response is valid is not necessarily the same as the reason another 
is valid. They may both predict the criterion and yet correlate zero with 
each other. The parts of such a test, one randomly chosen half and another, 
will probably not correlate very high with each other. The test has low 
internal consistency. Ап 7и computed in this manner would not do justice 
to the test. Neither would an alternate-forms ry, if the forms were developed 
independently. 

The only meaningful estimate of reliability for a heterogeneous test is of the 
retest variety. If, by chance, a heterogeneous test were developed, each 
item of which correlated with a criterion and yet did not correlate with any 
other item, the internal-consistency reliability would be zero. Yet, the retest 
reliability might be substantial or high. A biographical-data test of the type 
referred to above had a characteristic split-half reliability coefficient of about 
.35 and a retest reliability of about .65. Both of these values are unusually 
low, bit the test had a validity close to .40 for the selection of pilots and con- 
sequently was very useful. 

It is clear from the discussion above that the internal consistency and the 
stability of the same test need not agree very closely. There can be very low 
internal consistency and yet substantial or high retest reliability. It is 
probably not true, however, that there can be high internal consistency and 
at the same time low retest reliability, except after very long time intervals. 
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High internal-consistency reliability is in itself assurance that we are dealing 
with a homogeneous test, at least within the broad meaning of the term stated 
above. 

Speed Tests and Power Tests. Tests are also sometimes roughly categorized 
as speed tests and power tests. There is no sharp line of demarcation. A 
genuine power test is one that all examinees have time to finish. It is 
intended that every examinee shall attempt every item. Achievement 
examinations are in this category. Speed tests are those in which there is a 
time limit such that not all examinees can attempt all items. In this category 
are tests ranging all the way from those in which no one attempts all items to 
those in which 99 per cent may do so. The latter are so close to the power 
type that many examiners would be inclined to place them in the power 
category. As a general (rough) criterion, we may say that a power test is 
finished by at least 75 per cent of the examinees. 

It would be out of the question to use the odd-even method of self-correla- 
tion with a highly speeded test. If no examinee finished and if there were no 
errors, the correlation of halves would be +1.00 which would have no meaning 
except that the scorer had counted the numbers of reactions in the two halves 
correctly. If first and last halves were used, assuming everyone finished the 
first half and there were almost no errors, all scores for the first half would be 
about the same and those for the last half would depend upon the rate of 
work, The correlation would be near zero, for lack of dispersion of the first- 
half scores. 

In fact, any internal-consistency estimate of ri would be misapplied to a 
speed test. The errors caricatured above are present to some degree no 
matter which one of the internal-consistency methods we apply. A retest 
method will be adequate for many speed tests, except where there is identity 
of items and hence learning and memory are sources of variance, both true 
and error, in unknown proportions. For most speed tests, and this includes 
those in which any appreciable number of examinees fail to reach the last 
item, an alternate-forms type of reliability estimate is probably best. 

A good device to use in the development of new tests is to prepare two 
equivalent halves and to administer them in immediate succession as two 
separately timed tests. The correlation between the two halves, independ- 
ently administered, can be treated as we treat the correlation of any other 
half scores by the Spearman-Brown formula in order to estimate the reliability 
of the full-length test. The comparability of the halves can usually be 
accomplished by careful construction. Some check upon the adequacy of 
the efforts is in the comparability of means, standard deviations, and skewness 
of the two distributions. 

Meaning and Use of the Indices of Reliability. The retest method yields 
information about the stability of rank orders of individuals over a period of 
time, A high ғи from this source indicates that persons change very little in 
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status within their population from the first to the second testing; also that 
the test measures the same functions before and after the interval. A low ru 
of this type may mean that individuals have changed in different directions or 
in the same direction at different rates. Changes of means and of standard 
deviations will help to interpret the kinds of systematic changes taking place. 
Plots of scatter diagrams may show whether systematic changes are uniform 
over the range. These changes we call function fluctuations of individuals. 
If the test measures something different after an interval than before, we have 
a function fluctuation of the test. These changes can be examined by means of 
correlations of the test with other tests before and after the interval. 

There may be some practical reasons for knowing the stability of scores 
over periods of time and, if so, the retest rw is the index to use. Usually, the 
length of time is a factor to be considered, The chief use of this information 
is in deciding whether to depend upon scores that were obtained in an earlier 
testing or to administer the same test or a new form to obtain some scores 
that better describe the individuals right now. As a general policy it would 
be desirable to establish the principles regarding what kinds of tests yield 
stable scores, with what kinds of populations, and over what periods of time, 
and what kinds of tests do not. 

"The meaning of internal consistency was covered in a superficial way in the 
discussion of homogeneous tests. We shall go more thoroughly into the 
matter shortly in treating the specific methods under this category. This 
concept probably comes closest to the basic idea of reliability. The methods 
make an estimate of reliability from a single administration of a single test 
form. The estimate is of an “on-the-spot” reliability. It tells us something 
of how closely the obtained score comes to the score the person would have 
made at this particular time if we had had a perfect measuring instrument. 
For some purposes this information will certainly not be sufficient. It is the 
kind of reliability that does have meaning in connection with factorial 
descriptions of tests. These descriptions (see Chap. 18) attempt to depict a 
test in terms of its component variances, some of which combine to make up 
its true variance. It tells us nothing about function stability of persons or of 
tests. 

The alternate-forms estimate of r tells us something about function sta- 
bility in variations of the same test or in different items that have been 
designed to measure the same functions. It indicates how independent the 
measurements are of the particular items or content used. If the two forms 
happen to be two halves of the same test, then presumably the kind of items 
is the same in both (verbal, numerical, pictorial—matching, multiple-choice, 
completion); only the specific problems change. The alternate-forms Tu 
may tend to be slightly lower than the internal-consistency ru, but this may 
mean that it gives a more realistic picture of how accurately the test measures 

the general traits, ruling out whatever variance is dependent upon the par- 
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ticular content of one form of the test. The two estimates will be almost 
identical, probably, in power tests of very closely matched content. In 
power tests, then, the two methods could be used almost interchangeably. 
Tn speed tests, as indicated before, the alternate-forms method is the most 
justifiable approach to reliability estimate. 


INTERNAL-CONSISTENCY RELIABILITY 


There are several operations by which an internal-consistency estimate of 
reliability may be made, and there is so much basic test theory bound up with 
them that we need to give this approach special attention. First, we shall 
consider some more theory. 

The Statistical Nature of a Test Composed of Items. Most tests are com- 
posed of items. Most tests are scored by giving credit of +1 for a correct 
response to each item and a weight of 0 for each wrong answer or omission. 
The theory about to be explained assumes that kind of test. Furthermore, 
it applies best to a power test, in which omissions and wrong answers proba- 
bly mean inability to master the item. For the time being we shall not be 
concerned with the problem of chance success by guessing. We might 
assume completion items in which chance factors resulting from guessing are 
almost nil. The theory will probably apply to situations deviating appreci- 
ably from these specifications, enough so that the many conclusions to which 
it leads will have quite general application. 

Ilem Statistics. It is convenient to think of each item as а subtest in a 
larger composite. Each item, then, yields a distribution of scores, with a 
mean and a standard deviation. According to an earlier discussion of pro- 
portions (see Table 9.3), the mean of such a distribution, where the measures 
are either 0 or 1, is equal to №, the proportion of all who attempt the item who 
get the right answer. The variance of the distribution is equal to pg, where 
q = 1 — р, and the standard deviation is урд. 

The total score on such a test is the sum of part scores. In equation form, 

X. = Xat Xot R... HXi X. (17.10) 


(The sum of item scores to make a total test score) 
where Xa, Xy, ..., Xn = scores in items a, b, . . « , m when there ате л 
items in the test. 

The variance of the total test score can be derived from the variances and 
covariances of the items, according to the principles brought out in the pre- 
ceding chapter in connection with the variance of sums. Equation (16.19) 
applied to this particular use would read 


o = pada + Pu + Pede + D cb Bio gn 
+ 27 N/ PatsDupo + 2ге У федера: + + > + 
+ r n Voa- Qin- Pngn (17.11) 


(Total test variance as summation of item variances and covariances) 
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where fs, ,. + - , Ра = proportion passing items a, ,. . „n 
OT v= „ 41 — Pay 1 — h. . „1 фа 

Tob Foy + + « у ehe = intercorrelations of items 

In abbreviated, summational form, the equation reads 

ч = Ipa + 22ra У рабай ten anf. n in ner (1.1 
where f; = pa, fy, + + « , Ps in turn and ry = correlation between item i and 
item j, where subscript j is numerically greater than i. 

Deductions Derived from the Item-variance Equations. There are many 
useful and enlightening inferences that can be deduced from the equation just 
given. We shall consider only the most important ones here. 

Relation of Variance to Item Difficulty. The first thing to be noted is the 
relation of variance to item difliculty, Remembering that variance means 
individual differences and the greater the variance, the more we have dis- 
persed individuals in measurement, it can be stated that the item that will 
produce the greatest dispersion is of median difficulty. It is an item passed 
by half of the group and failed by half of the group. When р = q = .5, the 
pq product is at а maximum. As р approaches O or 1 the variance decreases 
toward the vanishing point. This has а common-sense explanation. Let us 
suppose an item that 1 person out of 100 can answer correctly. This item 
discriminates 1 person from each of 99, or makes 99 discriminations. Then, 
suppose an item that can be passed by 2 out of 100. This items makes 2 X 98 
discriminations, or 196, Continue this to 50, and we get 2,500 discrimina- 
tions, each one of the 50 who pass it from each one, in turn, of the 50 who fail 
it. Items of moderate difficulty, then, yield the maximum variance. 

Relation of Reliability to Item Intercorrelations, For the sake of internal 
consistency, however, large item variances by themselves would mean noth- 
ing. If equation (17.12) were limited to the item-variance terms alone, the 
test would have zero internal consistency, zero reliability of the internal type. 
This kind of reliability comes entirely from the covariance terms, and these 
are composed of item intercorrelations as well as indices of dispersion, It is 
only by virtue of their entering into the covariance terms that the item 
variances contribute to internal consistency. The intercorrelations of the 
items are the essential sources of this kind of reliability. The larger the item 
intercorrelations, the greater is the internal consistency, 

The Effect of Range of Item Difficulty upon Reliability, Reliability will be 
higher when the items are nearly equal in difficulty. A wide range of diffi- 
culty is not favorable to reliability. The reason is that the appropriate index 
of item intercorrelation is the coefficient. Operationally, with items scored 
as either 0 or +1, their distributions are best conceived as point distributions. 
If two items differ much in difficulty, the proportions passing the two differ 
and consequently is restricted in size. Only when the two items are equal 
in difficulty can the ¢ between them equal +1 as a maximum (see Chap. 13). 
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Two items very far apart in difficulty might correlate less than .20 even when 
each measures the same thing and measures it well. 

Effect of Item Intercorrelations upon Total-score Distributions. There is an 
interesting bearing of the internal consistency of a test upon the form of dis- 
tribution of total scores on that test. Imagine a test of 10 items each of 
exactly median difficulty for the population (p = g = .5) and each corre- 
lated +1.0 with every other item. A person who passes one item would pass 
them all and a person who fails one item would fail them all. There would 
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Fic. 17.5. Illustration of the effects of item intercorrelation upon the form of frequency 
distribution of total test scores. 


be only two scores possible, O and 10. If 20 examinees took this test, the 
chances are good that their frequency distribution would be like the first 
diagram in Fig. 17.5. There would be perfect and maximal separation of the 
two groups. The form of the distribution would be U-shaped. Examples of 
U-shaped distributions can be found in Hull's book on hypnosis and suggesti- 
bility, though they are not so extreme as the one in Fig. 17.5.! It appears 
that some tests of suggestibility are such that if the examinee responds in the 
suggestible manner in one trial he will respond similarly in all trials. 


!Hul,C.L. Hypnosis and Suggestibility. New York: Appleton-Century-Crofts, 1933. 
P. 68. 


452 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION (cn. 17 


If the item intercorrelations are not perfect but high, there will be some 
moderate scores but there will be a distinct tendency toward bimodality. 
The second distribution in Fig. 17.5 shows this type of test. With still fur- 
ther reduction in item intercorrelation, the distribution approaches rectangu- 
lar form, as in the third diagram in Fig. 17.5. With still further reduction in 
correlation, the distribution approaches normal form, but is somewhat 
platykurtic. A test of zero internal consistency, and with items of equal 
difficulty, would probably yield a normal distribution. Tt should not be con- 
cluded, however, that a normal distribution indicates zero reliability. It 
might do so, if all items were of equal difficulty at the level of p = .5. Rarely 
do tests conform to this condition. 

The Spearman-Brown Formula. The Spearman-Brown formula was 
designed to estimate the reliability of a test times as long as the one for 
which we know a self-correlation. So many times a split-half correlation is 
known for a test and the correlation of halves is an estimate of ry for the half 
test. The full-length test is not twice as reliable as the half test, but its 
reliability is greater and can be estimated by the special Spearman-Brown 
formula with » = 2. If we let rm stand for the self-correlation of a half test, 


2 na (Reliability of a total test estimated from reliability (17.13) 
1-crna of one of its halves) : 


When this estimation formula is used, comparability of the halves must be 
assumed. Comparability is indicated to some degree by the fact of similar 
means, standard deviations, skewness of distributions, and, of course, similar 
content. If comparability is lacking, the reliability of the total test will be 
wrongly estimated. Since comparability is probably never perfect, an esti- 
mate by the use of the Spearman-Brown formula is probably conservative, 
because it tends to be an underestimate. 

Because the split-half method and also the alternate-forms method in the 
form of two separately timed halves of the same test are so common in prac- 
tice, the chart in Fig. 17.6 is supplied as an aid in the use of formula (17.13). 
Since the estimates are rough, in any case, the graphic solution will probably 
serve for most purposes. 

For the general case, in which » could be any ratio of test length to that for 
which ru is known, 


nni Spearman-Brown formula fi liabilit 
"ndn ee eg (17.14) 


where rı = reliability of the test of unit length. 

As a matter of fact, the ratio u in equation (17.14) could be fractional as 
well as integral. If we knew the self-correlation for a test of 50 items, and 
we wanted to know the probable reliability for a similar test of 75 items, л 
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would equal 1.5. If we knew the reliability of a test of 100 items and wanted 
to know approximately the reliability for one of the same kind just half as 
long, u would be 0.5. 

As a matter of interesting information, the Spearman-Brown formula is 
derived from equations for the correlation of sums. Equations somewhat 
like (16.24) in the previous chapter have been developed for correlating one 
composite with another composite, when correlations between parts in each 
composite and between parts in one composite and parts in the other com- 
posite are known. The equation simplifies if the parts have equal variances 
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Fic. 17.6. Reliability of a total test score as a function of known reliability of a half-test 
score when the Spearman-Brown formula may be applied. 


and equal intercorrelations. The Spearman-Brown formula is such a simpli- 
fied equation. That is why we have to make the stated assumptions when 
applying it. 

Reliability Estimated from Item-test Correlations. If we knew the size of 
the item intercorrelations and if they were uniform in size, or nearly uniform, 
we could apply the Spearman-Brown formula, letting u equal the number of 
items, to find re. 

We would probably not want to take the trouble to determine the inter- 
correlations among items, but their average can be estimated in a manner that 
is feasible. It has been shown that when item intercorrelations are of about 
the same magnitude and when items are of approximately equal difficulty, 
the average item intercorrelation is equal to the square of the average correla- 
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tion of items with total score. In a formula, 


fü P, (Relation of average item intercorrelation to average item- 
fa too test correlation) (17.15) 


where the bars over the r’s indicate that they are averages; rij = correlation 
between item J and item J, a ¢ coefficient; and rz = correlation between item 
I and total test score, a point-biserial r. The item-test correlations are fre- 
quently known, as a by-product of item analysis. Their mean can be used in 
the Spearman-Brown formula, which would then read 


nu (Estimate of ru from average item-test (17.16) 


fu = TG =i correlations) 


where Fu = mean of correlations of items with total test score. 

The Kuder-Richardson Estimates of Reliability. Like the methods just 
described, the Kuder-Richardson formulas for estimating ru depend upon item 
statistics. They were developed because of dissatisfaction with split-half 
methods. A test can be split into halves in a great many ways, and each 
split might yield a somewhat different estimate of ru. The use of item 
statistics gets away from such biases as may arise from arbitrary splitting 
into halves. 

The Kuder-Richardson methods make the same assumptions as for the use 
of the Spearman-Brown formula, for the principle is the same as that above, 
where we applied this formula to estimates of item intercorrelation. To 
repeat, those assumptions call for items of equal, or nearly equal, difficulty 
and intercorrelation. 

The most accurate of the practical Kuder-Richardson formulas is? 


n а — Ўр (General Kuder-Richardson for- 
fu * (; z i) ( ої, ) mula for estimating reliability) (17.17) 


where n = number of items in the test 
P = proportion passing an item (or responding in some specified 
manner) 
mia 
It will be recognized, in comparing this formula with equation (17.12), that 
the numerator term (22, — Ур) is the sum of the covariance terms in the 
summation of item variances and covariances used to express the total test 
variance. ‘The expression Хро is the sum of the variances of all items. 
Deducting this quantity from the total test variance, we have left the sum of 
the covariances. It is in these covariances that the source of /rue variance 
1 . M. W. Notes on the rationale of item analysis. Psychometrika, 1936, 1, 
69—76. 
2 Richardson, M. W., and Kuder, G. F. The calculation of test reliability coefficients 
based upon the method of rational equivalence. J. educ. Psychol., 1939, 80, 681-687. 
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lies. The ratio of this quantity (c, — 7р) to the total test variance thus 
satisfies the basic definition of reliability given in the first part of this chapter. 
The factor n/(n — 1) is a minor correction that is needed to assure а maxi- 
mum possible ru equal to 1.00.1 

A Shorter Approximation to the Kuder-Richardson Reliability. If we are 
justified in assuming that all items in the test have approximately the same 
degree of difficulty, we may use a formula that is much less demanding of 
information. It reads 


25 т 627 — np (An approximation formula for the 17.18 
fü = aec Kuder-Richardson reliability) (17.18) 


where 5 and 7 = average proportions of passing and failing examinees for 
each item, respectively. 

The values of 5 and ӯ can be obtained without counting successes and fail- 
ures for every item, for the average p is equal to the mean of the total scores 
divided by u, and the average у is 1 — f. From these facts, the formula can 
be simplified to 

no. — RW 


fu le [Alternate to formula (17.18)] (17.19) 


where R = average number of right responses and W = average number of 
wrong responses (or » — R). R is, of course, the mean of the total scores. 
In more familiar symbols, 


no% — M(n — М) 


(n — De 


It should be said that all the Kuder-Richardson formulas, indeed all the 
internal-consistency formulas that depend upon a single administration of a 
test, probably underestimate the reliability of a test, formula (17.20) most of 
all. Of all these formulas, (17.17) should usually come closest to the correct 
value of ra under the conditions of testing prevailing. Although some of 
these formulas get away from appearance of item statistics in them, it should 
not be forgotten what assumptions are implied. They do not apply to speed 
tests, including, in fact, most time-limit tests. 

Several other variations of the formulas have been proposed to meet special 
requirements. Hoyt suggests a formula convenient for use with raw data, a 
formula not requiring the computation of a mean or a variance.’ For the 


ru = [Substitute for formula (17.19)] (17,20) 


1 Brogden has shown empirically that variation in difficulty of items over very wide 
ranges does not lead to appreciable bias in the estimation of ғи by formula (17.17). 
Brogden, Н. E. The effect of bias due to difficulty factors . . . on the accuracy of estima- 
tion of reliability. Educ. psychol. Measmt., 1946, 6, 517-520. 

2 Hoyt, C. J. Note on a simplified method of computing test reliability. Educ. psychol. 
Measmt., 1941, 1, 93-95. 
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test in which items are weighted differently, Dressel has provided a useful 
variation.! Dressel also provides formulas to apply when scoring formulas 
are used, weighting wrong responses and omissions differently. 

The Rulon Method of Estimating ru. It was mentioned earlier that Rulon 
had developed a method of computing the standard error of measurement, 
Ttw, from differences in scores on two halves of a test. Because of the rela- 
tions between ru and o, the same approach leads to another kind of estimate 
of reliability. It is usually applied to halves of the test in a single administra- 
tion and hence comes under the category of an internal-consistency reliability, 
but it could also be applied to alternate forms. 

Because c*,, measures the amount of error variance, an estimate of ru is 
given by the formula 

т 1 — Te (Reliability by the Rulon formula) (17.21) 
t 
where ce = Nd, N, as in formula (17.9). 

Rulon's formula* is especially applicable when an IBM test-scoring machine 
is available, for this instrument can be so adjusted as to yield a difference 
between odds and evens for each examinee. 

The Rulon method is subject to the same restrictions as for any split-half 
procedure. It should be noted that the formula gives the reliability of the total 
test scores and not of the halves, and so the Spearman-Brown formula should 
not be applied. If the Rulon difference formula should be applied to differ- 
ences between scores on two forms, the reliability coefficient thus estimated 
applies to a test of twice the length of either form. A correction to the 
reliability wanted for each form can be made by substituting .5 for u in 
formula (17.14). 

A Summary of Internal-consistency Reliability. Internal-consisten cy 
reliability is most appropriately applied to homogeneous tests, 7.e., tests com- 
posed of equivalent units—equivalent in several respects. The parts 
(usually items) all measure the same trait, or traits, to about the same degree. 
The total variance of a test can be conceived as a sum of the variances and 
covariances of its parts. The true variance of a test is contributed by its 
covariances to which both the item variance and item intercorrelations are 
important contributors. Internal-consistency reliability is greatest when: 

1. The item intercorrelations are greatest. 

2. The variance of items is greatest. This is when the proportion passing 
an item is .50. 

3. The items are of equal difficulty. Then the item intercorrelations are 
at a maximum. 


See Dressel, P. L. Some remarks on the Kuder-Richardson reliability coefficient. 
Psychometrika, 1940, b, 305-310. 
2 Rulon, ob. cit 
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In estimating an internal-consistency ru, most methods rest upon the 
assumptions of equivalence of parts in the sense of equality of difficulty and 
equality of intercorrelation. If these conditions are not satisfied, estimates 
of ru may still be made, but the farther the departure of the situation from 
these specifications, the more is ry likely to be in error.* 


Some SPECIAL PROBLEMS IN RELIABILITY 


Like all coefficients of correlation, ru, however estimated, must be inter- 
preted in a relativistic manner. Its size depends upon many conditions under 


IE — — C complete nge 


Fic. 17.7. Illustration showing an extreme instance of curtailment of range. The corre- 
lation for the cases within the smaller rectangle will be much smaller than the correlation 
of all cases within the larger rectangle. 


which it is obtained experimentally. Some of the more important conditions 
and considerations will be mentioned in what follows. 

Reliability in Different Ranges of Measurement. Like intercorrelations 
of different variables, self-correlations are affected by the range of ability or 
of a trait present in the population sampled. The narrower the range, the 
smaller ru tends to be. This can be seen mathematically if one examines 
formula (17.21), where ғи is given as equal to 1 — ., If the standard 
error of measurement remains constant regardless of the range of ability in 
the sample, we see that if the range, as measured by оь decreases, the denomi- 
nator c, decreases, the ratio co, increases, and ru decreases. This is why 
some test users prefer to know g rather than ru concerning a test, since it is 
probably more stable from population to population. It is another good 
reason we should not speak of fhe reliability of a test. Figure 17.7 illustrates 


For a much more complete discussion of reliability and how to estimate it, and for 
descriptions of item-analysis methods, see Guilford, J. P. Psychometric Methods, 2d ed. 
New York: McGraw-Hill, 1954. 
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how in a restricted sample (small square) the same scatter of points gives a 
relatively wider spread and hence a lower correlation. „Restriction is not 
ordinarily as clear cut as this in practice, but the principle is the same. 

If we wish to estimate the reliability coefficient in one range from the known 
reliability in another range, the following formula may be used. It assumes 
equal standard error of measurement in both ranges. 


зү = (Estimation of ru їп a population of one 
Tm = 1 — Call = te dispersion from: that 5 another similar (17.22) 
on population of different dispersion) 


where e, = standard deviation of the distribution for which the reliability 
coefficient is known 
on = standard deviation of the distribution for which the reliability is 
not known 
Too and ran = reliabilities in the two respective distributions 
If we know that a more limited group has a standard deviation of 8.0 and a 
reliability coefficient of .85 for a test, what will be the reliability coefficient in 
a more variable group whose c is 10.0? Applying formula (17.22), 


_ 8°. — 85 


10: 904 


Ton = 1 
Reliability and the Length of Test. It was indicated in connection with 
the split-half method that the whole test is more reliable than either half and 
that in general terms there is an increase in reliability going with increased 
length of test. This is true if the additional items added to a test are homogene- 
ous with the ones to which they are added. By homogeneous we mean that they 
have about the same intercorrelation with the items already in the test as 
those items have among themselves and possess about the same level of diffi- 
culty. If a test is lengthened to n times its present length under these condi- 
tions, we have a right to expect a change in reliability in accordance with the 
Spearman-Brown equation, which was given previously [formula (17.14)]. 
Lengthening a Test to Attain a Certain Desired Reliability. We can use the 
Spearman-Brown formula in reverse. If we know the reliability of a short 
test is .75, we can ask how long the test would have to be to attain a reliability 


of .90. If we solve the equation of the Spearman-Brown formula to find n, it 
becomes 


он all =) (Estimation of length of test required for a given 


full — Fan) reliability) (17.23) 


Substituting the known values in this equation, we have 


s _ .90(1 — .75) 


"= 7750 — .90) 


= 3.0 
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The test with ri; = .75 would have to be three times as long to attain а 
reliability of .90. 

Any other level of reliability, larger or smaller, in which we are interested 
can serve as гл», and the necessary n ratio can then be computed. Experience 
will show that some tests of low reliability cannot reach some desired high 
reliability without being made indefinitely long, or so long as to be impractical. 
Others will exhibit promising improvements in reliability with a moderate 
amount of extension. The formula is useful in this respect, that it helps 
decide upon rejection or extension of tests, or it is useful in cases in which a 
test is already too long for comfort and we need to decide whether shortening 
it would sacrifice too much in reliability. 

Reliability of Ratings and Other Judgments. Many of the statistics 
described in connection with test scores also apply fairly well to human judg- 
ments of various kinds. The judgments may be in the form of rank order, 
rating-scale evaluations, pair-comparisons scaling, judgments in equal- 
appearing intervals, and the like. We can correlate the same observer's 
judgments obtained at two different times, or we can assume that similar 
judges are interchangeable and intercorrelate their evaluations (see discussion 
of intraclass correlation in Chap. 12). We can pool judgments for two com- 
parable groups of observers and correlate them so long as they apply to the 
same objects or persons. 

Experience has shown that with due cautions these applications may be 
made with meaningful results. Every coefficient must, as usual, be inter- 
preted in the light of the manner in which it was obtained, Even the 
Spearman-Brown formula has been shown to apply, as, for example, in the 
pooling of judgments from two observers, which yields increased reliability 
in a manner found for the doubling of a test in length. The comparability 
of judges must be true here just as the comparability of items must be true in 
applying this formula to the change in length of test. 


Exercises 


1, The following reliability coefficients were presented for a certain test: 


. .96 Retest after 1 month 


Split half. j 
. .94 Retest after 2 years... 


Alternate form 


Are these coefficients reasonable? Explain. 

2, In six tests, the following correlations were found between halves composed of com- 
parable items: 43 .55 66 .74  .86 94. Determine the reliability coeffi- 
cient for the full-length tests. 

3. In a certain test, the sum of the squared differences between scores on two comparable 
halves equaled 285. N = 50 апіс = 8.5. Find the coefficient of reliability for the total 
scores and the standard error of measurement. ' 

4. Ina test of 55 items, the SD of the total scores was 7.5. The sum of the variances of 
the items was 9.8327. Estimate the reliability of the scores. 


b. 
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5. Another test of 150 items has a SD of 244 and a mean of 94.2. Estimate the relia- 
bility of the scores, assuming that the items are approximately equal in difficulty and 
intercorrelation. 

6. In four tests, the reliability coefficients were .65, .76, .87, and .94. Determine rn 
and oi in each case, assuming a SD of 10.0. 

7. Complete the following table, determining all the needed values of ran: 


8. For the coefficients in the completed table for Exercise 7, plot on graph paper the 
increase in fan (on the ordinate) as я increases (on the abscissa) for each value of r. Draw 
some general conclusions. 

9. Complete the following table, computing the necessary n's: 


2. ти = 40; 71; .80; 85; .92; 97. 
J. т = 92; % = 2.99. 

4. 7% = BA [by formula (17.17)]. 
5. Fia ™ 98 [by formula (17.20)]. 


б. ы: 81, 87, .93, 98; .: 5.9, 4.9, 3.6, 24. 
7. N e 4, 63, B1; n = .70; 78, 96; ri = ‚90; 97, .98, .99. 
9. т = 30, я: 7.00, 44.33; when u = .50, s: 1. 5.67; 2 
уи 4 ; ia = (80, я: 1.86, 5.67; when ru = .70, 5:128, 
10, faa: .80; .91. 


CHAPTER 18 


VALIDITY OF MEASUREMENTS 


While most of the comments in this chapter will be about the validity of 
tests, the problem of validity arises in all kinds of measurements. Most of 
what is said about validity of tests applies to other methods of evaluation and 
measurement. 

Рков1ЕМ8 or VALIDITY 


It is usually easy enough to apply a metric instrument and to obtain some 
numerical data. In the physical sciences the meaning of numbers that are 
used to describe phenomena is usually well established. The values stand 
for degrees of electrical resistance, pressure of a gas, or mass of a particle, In 
the social sciences, however, the connection between a number and the thing, 
or things, for which it stands is not nearly so obvious. 

Nor is the situation helped very much or the problem solved by conjuring 
a name for a supposed variable that the numbers stand for. There is said to 
bea country in which, until recent years, at least, it was regarded as bad taste 
for anyone to question whether a certain test measures trait X if the dis- 
tinguished psychologist who invented the test says it measures trait X. 
There are other, supposedly more enlightened countries, unfortunately, in 
which the same attitude exists to some degree in some quarters, The prob- 
lem would not be so serious if conclusion after conclusion about supposed 
underlying properties were not built upon the evidence of measurements 
which may not, after all, have much to do with those properties, There may 
even be considerable question, also about the existence of the properties, 

Types of Validity. The question of validity, of a test or of any metric 
instrument, has many facets, and it requires clear thinking not to be confused 
by them. In crudest terms, we say that a test is valid when it measures what 
it is presumed to measure. This is but one step better than the definition 
that states that a test is valid if it measures the truth. 

In this chapter it will be held that validity is a highly relative concept. И 
the question is asked about any particular test, “Is this test valid?” the 
answer should be in the form of another question, Is it valid for what?” 
Furthermore, just as we found in the preceding chapter that we cannot, 
strictly speaking, state any figure as representing /ле reliability of a test, so 
we cannot give a single. number to indicate the validity of a test. 
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"There was a time, unfortunately still not entirely past, when each test was 
supposed to measure some underlying variable that went by a label. It was 
a test of intelligence, of introversion, or of neurotic tendency. Those con- 
cepts, because of the fixed labels, were supposed to be qualitatively stable, 
known, and defined attributes. In order to be valid, tests going by those 
names were expected to correlate highly with older, generally accepted criteria 
of those supposed entities. For example, new tests were “validated” by 
demonstrating a strong correlation with the Stanford Revision of the Binet 
test or with Laird’s test C2 or with Woodworth’s inventory. 

Factorial Validity. Now that these popular areas of personality have been 
shown to lack real unity and unanimity of reference,' we are properly more 
wary of attaching such labels to tests. If we regard intelligence as having 
been broken down into a collection of functional unities, called primary 
abilities for convenience, we find that the question of what is a valid intelli- 
gence test becomes meaningless. The primary abilities, on the other hand, 
have been arrived at by means of well-defined steps and can be verified by 
one who repeats those steps. If one acquiesces in the procedures by which 
those functional unities are discovered, he has no choice, if he still is con- 
cerned about the validity of tests, but to ask whether test A is a valid one for 
measuring this primary ability or that one. 

The validity of a test as a measure of one of these factors is indicated by its 
correlation with the factor, which is its factor loading.* It is recognized by 
those who adopt the factor-analysis approach that scarcely any test is an 
unadulterated measure of any primary ability or trait. Not only is it diluted 
by errors of measurement, as we saw in the discussion of reliability, but it is 
also adulterated with variances in other primary abilities or traits. ‘This 
situation is overcome to some extent by a careful combining of tests, an 
exacting procedure that we cannot go into here. It is the author's belief that 
the best answer to the question, “What does this test measure?” is in the 
form of a list of the primary factors with which it correlates and their propor- 
tions of variance in the test.“ This kind of validity may be called factorial 
validity. This idea will be explained more fully and it will be shown that it is 
basic to the understanding of other kinds of validity and of many phenomena 
of correlation in general. 

Practical Validity. The vocational counselor and the vocational selector 
face a different kind of problem when they inquire about validity of tests. 

See in particular Thurstone, L. L. Primary mental abilities. Psychometr. Monogr., 
1939, 1; Guilford, J. P., and Guilford, R. B. Personality factors D, R, T, and А. J. 
abnorm, soc. Psychol., 1939, 34, 21-36; and Mosier, C. I. A factor analysis of certain 
neurotic tendencies. Psychometrika, 1937, 2, 263-286. 

For a brief discussion of factor theory and methods, see Guilford, J. P, Psychometric 
Methods, 24 сі. New York: McGraw-Hill, 1954. Chap. 16. 

Berr J.P. Factor analysis in a test-development program, Psychol. Rev., 1948, 
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They are concerned about predicting outcomes in specified tasks and situa- 
tions—clerical ability, scholastic ability, salesmanship, and the like. A test 
is a valid one for clerical aptitude if its scores correlate highly with later 
clerical proficiency. Another test is a valid one for aptitude in selling, 
because it correlates highly with later proficiency in selling. From this point 
of view, any test is valid for any sphere of behavior if it enables us to predict 
within that sphere, regardless of the name of the test or the supposed funda- 
mental abilities that it measures. A test designed to predict the success of 
student aviators may prove also to be a valid test of scholastic aptitude in 
engineering or of aptitude for a military career in general. From the practical 
standpoint, the validity of a test is its forecasting efficiency in predicting any 
measurable aspect of daily living. 

Criteria for Validity. One of the most difficult of all aspects of the validity 
problem is that of obtaining adequate criteria of what we are measuring. 
The factor-analysis approach has a fairly good solution when it is primary 
traits or abilities that we wish to measure, If two or more tests or items are 
combined to predict the factor, the validity coefficient is the multiple correla- 
tion between the tests and the factor. But practical criteria are most in 
demand and are most difficult to obtain and to measure adequately, An 
example of this is the criterion of scholastic achievement. 

It has often been assumed that scholastic achievement, like intelligence, 
is a unitary attribute of each individual. But this is far from the truth. 
Although there is generally a positive correlation between achievement in 
different school subjects, there is sufficient disagreement to permit an indi- 
vidual to receive marks all the way from A to F in different subjects, It is 
best procedure, therefore, to examine the validity of each test used for guid- 
ance purposes in connection with every school subject taken by itself. Where 
а certain test of ability may possess only a moderate or low correlation with 
averages of school marks, it may correlate very high with specific courses. 
The writer has data showing correlations all the way from .37 to .74 between 
the Ohio State Psychological Examination, Form 20, and marks in freshman 
courses at a certain university. 

The point is that success in ahy sphere of life is ordinarily highly complex 
and is determined by many psychological factors in the individuals com- 
peting, rather than one or a few, Tf we measure success in a complex activity 
by singling out as criteria one or more of its aspects and measuring them, we 
are checking upon the validity of the test or tests for predicting those chosen 
aspects. We should not identify those few aspects with the entire activity. 
We should, of course, attempt to single out the most significant aspects as 
criteria, Too often some inconsequential aspects are chosen because of their 
ready observability and measurability. 

Having chosen the measurable variables of success in the area predicted, 
we have the problems of securing dependable measurements and perhaps of 
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combining and weighting them in the wisest manner. With reference to 
measures of achievement, again, it should be emphasized that school marks 
as ordinarily assigned by teachers are rather poor metric material. Varia- 
tions in meaning and standards from teacher to teacher and from course to 
course are notorious. Most marks are neither very reliable nor very valid 
indicators of achievement. The best measures of achievement in most 
courses are those obtained directly from good, comprehensive examinations 
of the objectively scored type. Marks otherwise obtained often have relia- 
bilities in the range from .60 to .80, and their validities are unknown. When 
we attempt to find the predictive value of a psychological test, therefore, shall 
we reject tests that fail to correlate highly with such fallible criteria? We 
can allow for the unreliability of criteria satistically when we know a coeffi- 
cient of reliability for them. We cannot so easily know or allow for lack of 
validity of criteria, though we can make allowances, knowing the kind of 
criteria we have. 


A Brier Intropuction To Factor THEORY 

Because so many of the facts of validity are explainable on the basis of 
factor theory, it is desirable for us to examine the basic features of factor 
theory in order to gain a better grasp of the problems and methods involved. 
‘There is not space here to describe the procedures for making a factor analysis 
of tests. These statistical procedures when described sufficiently for general 
use would take up a small volume in themselves. 

Basic Assumptions in Factor Theory. It is best to begin with basic 
theorems, two of which will give us the foundation we need for the logic of 
validity. 

Theorem I: The total variance of a test may be regarded as the sum of three 
kinds of component variances: (1) that contributed by one or more common 
factors, common because they appear in more than one test; (2) that unique 
to the test itself and possibly to its equivalent forms; and (3) error variance. 
We are now ready to break up what was called /rue variance in the preceding 
chapter into component variances. Both the common-factor variances and 
thé specific variance in a test contribute to its internal-consistency reliability, 
and to its equivalent-forms reliability, It is not necessary to assume that the 
common-factor and specific-factor variances are all independent or uncorre- 
lated. To do so relieves us of having to deal with covariance terms and thus 
simplifies the picture. What follows would be just as true, in general, if we 
did not add this specification to the assumption.? 

! The most profound source of information on factor analysis is Thurstone, L. L. Multi- 
le Factor Analysis. Chicago: University of Chicago Press, 1947. For other presentations, 
see Cattell, К. В. Fador Analysis, New York: Harper, 1952; and Fruchter, B. Intro- 
dyction to Factor Analysis. New York: Van Nostrand, 1954, 

? This theorem and the second follow from the basic Postulate that an obtained test 
Score is a simple summation of components from the sources indicated in theorem I. 
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Theorem I may be stated in the form of an equation: 
о% = 0 +o%+ --- +0%, n, +0% (18.1) 


(Sum of independent variances in scores on a test) 


where c?, = total variance of a test 
05,055 . . . , 0%, = variances in factors A, B, . . . , №, respectively 
а?, = variance specific to this test 


c?, = error variance 
If we now divide equation (18.1) through by o%, we have 


= 1.00 (18.2) 


Substituting new symbols for these fractions, which are proportions, we have 
1.00 = at, ＋ b H e , . 5 K ei, (18.3) 


(Proportions of factor variances in a test) 
where a?,, b%,, . „ n, = proportions of total variance contributed to test 
X by factors A, B, . . . „ N, respectively 
s*, = proportion of specific variance in test X 
€*, = proportion of error variance in text X 
In the same notation, the reliability of test X can be written as 
re = 1 — е, = а +6. 4+ o т, s (18.4) 
(Reliability as a sum of proportions of nonerror variance) 
This equation will be useful in discussions of the relation of validity to relia- 
bility later on. 

Communality. A new concept that should be pointed out here, although 
we shall not have occasion to do much with it in a practical way in this chap- 
ter, is known as the communality of a test. The communality of a test is the 
sum of the proportions of common-factor variances. In equation form, 


№, = а, + 0%, bn (Communality of a test) (18.5) 


The communality of a test contains all the nonerror variance except the 
specific variance. Communality is what gives any test the chances of corre- 
lating with other tests and with practical criteria. If there were no com- 
munality in a test it could be quite reliable and still not correlate with any- 
thing else. On the other hand, a test could have relatively low reliability, 
and yet if all its nonerror variance were in common with variance in other 
variables, its correlations with other things could be rather substantial; hence 
its validity could be good. 

A Numerical Example of Component Variances. As an example, let us 
consider three tests and a practical criterion. Five common factors are 
represented in these four variables. In Table 18.1 we have listed the propor- 
tions of common-factor, specific, and error variance for each variable. Test 1 
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Taste 18.1. PROPORTIONS OF COMMON-FACTOR, SPECIFIC, AND ERROR VARIANCE IN 
THREE TESTS AND A PRACTICAL CRITERION OF PROFICIENCY 


Common factors 


Variable 


has 36 per cent of its variance accounted for by factor A, and 36 per cent by 
factor C. The sum of these two components equals 72 per cent, which repre- 
sents the communality of this test. Add the 10 per cent specific variance, 
and we have 82 per cent, which represents the test's true variance and a 


0 Ol 02 03 04 05 06 07 08 09 10 
ү 


JE aas x е} | Test/ 
Test.2 
mes 
LLL amen. 
ò ar oz o3 04 05 06 Q7 08 09 10 


Proportion of variances 


Fto, 18,1. Proportions of common-factor, specific, and error variance in three hypothetical 
tests and a criterion. 


reliability of .82. The remaining 18 per cent is error variance. The other 
tests and criterion J can be interpreted in a similar manner. Figure 18.1 
shows the component variances for these same four variables, each as a 
segment of a bar diagram. 

Factor Loadings. "The proportion of a total variance contributed by one 
component may be regarded as a coefficient of determination of the total by 
the part. The square root of each proportion of variance contributed by a 
common factor may therefore be regarded as the correlation between the 
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total variable and the factor. These square roots are correlation coefficients 
and are known as factor loadings or factor saturations. For the three tests and 
criterion J, the common-factor loadings are given in Table 18.2, Test 2 
correlates .40 with factor A, .35 with factor C, and .80 with factor F, Factor 
F has no correlations with other variables in this list, but in order to be 
regarded as a common factor it must have some correlation with other varia- 
bles not in this list. 

The square roots of specific variance are not listed because it is not certain 
what the specific variances represent. A certain specific variance may indeed 
be unique to its own test, but it may be a composite of some kind, in which 
case each component of the specific variance would have its own correlation 
with the total. On the other hand, some specific variances might turn out on 
later analyses to be one or more unrecognized common-factor variances. 
Certain tests have been known to lack any specific variance at all, the entire 
true variance being composed of common-factor components and the com- 
munality equaling the reliability of the test. 

Tapte 18.2. Factor Loapincs (CORRELATIONS OF COMMON FACTORS WITH 
EXPERIMENTAL VARIABLES) FOR THE THREE TESTS AND A CRITERION 


Common factors 
Variables — — 


A B [4 D F 
Test 1,. itia гий ОО ‚00 60 ‚00 .00 
Test 2... a. уке CORPER EE .00 35 00 ‚80 
Test 3. "m 00 70 .00 .50 .00 
Criterion J 50 


Theorem II: The second major theorem of factor analysis is that the corre- 
lation between two experimental variables (such as tests and criteria) is 
equal to the sum of the cross products of their common-factor loadings. In 
equation form, 


т. = aj, + bb. + + + © + Mitte 
where a; and a, = loadings of factor A in criterion J and test X and b; and 
b, = loadings of factor B in criterion J and test X, etc. 


How Factor Theory Explains Practical Validity. Applied to the loadings 
given in Table 18.2, the correlation between tests 1 and 2 would be 


та = (.6)(.4) + (.0)(.0) + (.6)(.35) + (.0)(.0) + (.0)(.8) = 45 


The correlation between test 1 and criterion J (its validity for predicting 
criterion J) would be 


ra = (.4)(.6) + C3)60) + C4)66) + (.5)(.0) + (0)(0) = 48 


A lation as a sum of 
, factor loading products) (18.6) 
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TABLE 18.3. INTERCORRELATIONS OF TESTS AND CRITERION J DERIVED FROM 
THEIR COMMON-FACTOR LOADINGS 


Tests 


Variables EE ii 


The other intercorrelations and validity coefficients found in similar manner 
are listed in Table 18.3. In experimental practice we do not know the factor 
loadings first and derive from them the intercorrelations; we know the inter- 
correlations and by factor analysis arrive at the factor loadings. We have 
assumed that the factor loadings are known here for the sake of illustration. 
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Fro. 18.2. Segments of three intercorrelations of tests and a criterion that are contributed 
by different common factors. 


© 


Examination of the three validity coefficients in Table 18.3 shows that they 
are .48, .30, and .46, for tests 1, 2, and 3, respectively. The three validity 
coefficients are represented graphically in Fig. 18.2. The reasons for the 
validity of tests 1 and 2 are the same; their common ground with the criterion 
is in factors A and C. The reason test 3 is valid, however, is totally different 
from this. Test 3 is valid because of having in common with the criterion 
factors B and D. Test 2 has the lowest validity for predicting criterion J, 
but its unusually large loading in factor F offers strong possibilities for its 
validity in predicting some other criterion that has a substantial loading in 
factor F. 
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How Factor Theory Explains Multiple-correlation Principles. The multi- 
ple correlations of some of these tests and criterion J can be nicely explained 
by the various factor loadings. The multiple correlation Куз = .49, which 
is only. Of higher than the correlation 51. Adding test 2 in a battery to test 1 
to predict J is of little value because both bring to the composite a coverage 
of the same common factors in J. The multiple R. 18, however, is equal to 
.66. Adding test 3 to test 1 to make a joint prediction of J is very effective 
because the two tests cover totally different components in J. The multiple 
Rus is less than Куз, being .55. The reason for this is that test 2 does not 
cover factors A and C nearly so well as does test 1. 

Optimal Weighting of Factors in Composites. We might well raise the ques- 
tion at this point as to whether tests 1 and 3, optimally weighted, with their 
multiple R of .66, have yielded the maximum amount of validity possible for 
a weighted composite that contains factors A, B, C, and D. Reference to 
equation (18.6) will show that the correlations 51 and 7j; could have been 
higher if the tests’ factor loadings ал, c1, bs, and ds had been larger. The only 
limits to those factor loadings would be that the communalities should not 
exceed 1.0. 

This, however, is not the whole story. We could make those loadings as 
large as the communalities would allow and they would still not yield the 
maximal correlation with criterion J unless they were in the right proportions. 
The right proportions would have to take into consideration the proportions 
of loadings aj, bj, с, and d; in the criterion. With sufficient loadings of the 
four factors in the tests and with proper weightings, the maximum validity 
for the composite in predicting criterion J would be equal to the square root 
of the communality of that criterion. The square root of .66 is .81. This 
principle is reminiscent of the one mentioned in the last chapter regarding the 
index of reliability, which is the square root of the reliability coefficient. It 
gives the maximum possible correlation of anything with the variable in ques- 
tion. In this statement, however, is latent the assumption that all the true 
variance is common-factor variance; that 4? = ru. 

It is doubtful whether tests 1 and 3 could ever be weighted appropriately 
to yield a validity for their composite equal to the maximum .81 with criterion 
J, even though their common-factor loadings were as large as possible. The 
reason is that factors A and C are tied together in the same test and factors 
Band Dare tied together in the other test. Since factors A and C have equal 
loadings in criterion J and also in test 1, as long as they keep the same ratio 
in test 1 they would be properly weighted in a regression equation. This is 
merely a coincidence in this particular problem. Factors B and D, however, 
are weighted in reverse order in test 3 and criterion J. For optimal prediction 
of J, the loading d; should be greater than the loading bs, to correspond with 
the fact that the loading d; is greater than 6. If we had loadings 63 and d; in 
proportion to the loadings 5; and d; and also 50 per cent larger (just as a, and 
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c; are 50 per cent larger than a; and cj), they would be .45 and .75, respectively. 
These would yield [by equation (18.6)] an rj; equal to .51 (where it was .46) 
and a multiple R of .70 (where it was .66). 

The moral of this is that, for the freedom to weight each factor in a com- 
posite as it should be weighted to get the maximal prediction of a criterion, 
it is best to use unique, or univocal, tests, i. e., each test with but one common 
factor. In practice, a regression weight has to be applied to the test as a 
whole and all factors in it are weighted the same, in so far as external weights 
are applied. 

Increasing Validity by Adding Factors. We have just seen that increasing 
the practical validity of a composite depends upon large factor loadings for 
factors represented in the criterion and an optimal weighting of the individual 
factors. "There is another important way of increasing the validity of a com- 
posite, and that is to bring in a new test that covers a common factor in the 
criterion that is not already covered. Criterion J was reported to have 14 
per cent of its variance devoted to specific sources. It is possible that this 
portion of the variance in J is really contributed by an unknown common fac- 
tor. Further experimental work might lead to an identification of it as 
stemming from one or more common factors. Suppose that it were found to 
belong entirely to one additional factor G. To contribute .14 to the total 
variance, the loading g; would be about .37. With an additional test to meas- 
ure this factor in the composite, the multiple R could be increased materially. 

On the whole, there is much more to be gained in increasing R by discovery 
or identification of new factors than there is by increasing loadings for already 
known factors. With a large number of factors in a criterion, sizes of load- 
ings will have to be small in order to stay within the limit of its communality, 
and their multipliers (loadings in the tests) can be correspondingly small, so 
as to produce a maximum validity coefficient, within the limit of the square 
root of that communality. 


Conpitions upon Мнісн Vatipiry DEPENDS 


Relation of Validity to Reliability. It has been a common belief that the 
practical validity of a test, other things being equal, is directly proportional 
to its reliability—the more reliable a test, the more valid it is. There is much 
in the application of factor theory to support this idea, as we can see by 
reference to previous paragraphs. The greater the error variance in a test, 
the less room there is for common-factor variance, and common-factor vari- 
ance is the source of validity. If we make a test more reliable and in so doing 
we increase variances in common factors, the possibilities for validity should 
be increased accordingly. 

When Validity and Reliability Are Independent. There are important 
exceptions to this relationship between validity and reliability. If a test is 

heterogeneous, we might have a very low internal-consistency reliability and 
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yet a high practical validity. Ifa test is homogeneous, it would be possible to 
increase its reliability without affecting its validity. The increased relia- 
bility might mean added variance in a common factor that has no relation to 
the criterion. For example, a test measuring visualization is known to have 
validity for the selection of pilots. We might increase the reliability of this 
test by making it more difficult, thereby adding reasoning variance. If 
reasoning variance has no correlation with the pilot criterion, no improvement 
in pilot validity would follow such a change in this test. The added common- 
factor variance in a test will increase the practical validity of a test only when 
that new type of variance is also present in the criterion. If there were no 
valid variance in a test to begin with, no amount of increased reliability will 
give it validity unless the added variance is related to the criterion. 

Goals of Validity and Reliability Sometimes Incompatible. When we seek to 
make a single test both highly reliable (internally) and also highly valid, we 
are often working at cross purposes. The two goals are incompatible in many 
respects. In aiming for one goal we may defeat our efforts toward the other. 

Maximal reliability requires high intercorrelation among items; maximal 
validity requires low intercorrelations. Maximal reliability requires items 
of equal difficulty; maximal validity requires items differing in difficulty. 
This point needs some explanation. Tucker has demonstrated this fact 
mathematically, but there is a simpler, common-sense rationale.! A range 
of difficulty is necessary, of course, in order to obtain graded measures of 
individuals. It was shown in Chap. 17 how with perfect intercorrelation of 
items (which could occur with ф coefficients only when items are of equal 
difficulty) there were only two scores—perfect scores and zeros. For spacing 
individuals in fine enough graduations for measurement purposes it is neces- 
sary to have a continuous distribution, not a U-shaped one. It would be 
ideal, for fine measurements, to space items, each discriminating well between 
all those above a certain point on the scale and those below, rather evenly all 
along the range of ability in the population. With such spacings, intercorre- 
lations could not be perfect, and some would, indeed, be very low. 

There must be some compromising of aims; both reliability and validity 
cannot be maximal. Fortunately, the kind of moderate item intercorrela- 
tions usually obtained for well-constructed items are of the size that, accord- 
ing to Tucker’s conclusions, will yield good validities. They will also yield 
satisfactory reliabilities, but those reliabilities will not often be above .90. 
To be more specific, the item-test correlations for well-constructed items 
range between .30 and .80, which means item intercorrelations approximately 
between .10 and .60. Items within these ranges of correlation should provide 
tests of both satisfactory reliability and validity. There is probably better 
reason for going below these limits than above them in constructing items. 

1 Tucker, L. R. Maximum validity of a test with equivalent items. Psychometrika, 
1946, 11, 1-13. 


472 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION ([сн. 18 


To do so would probably err on the side of validity, which, after all, is the 
more important. 

Homogeneous Tests; Heterogeneous Batteries. The relation of heterogeneity 
to validity deserves more attention. One way to make a test more valid is to 
make it more heterogeneous. In factorial language this means adding new 
factors. If we succeeded in getting into the scores of the single test all the 
factors that are also in the practical criterion, and if we weighted them 
properly, we could achieve maximal accuracy of predictions from the single 
test. 

Recall, in this connection, the principles of the multiple-regression equa- 
tion. Maximal multiple correlation is achieved by minimizing the inter- 
correlations of the independent variables. If we apply this to test items, as 
separate variables, the principle still holds. The ideal test, from this point 
of view, would be one in which each item measured a different factor (and 
measured it consistently). This would mean a test of low internal reliability. 
It would also mean a test, which, though correlating well with the criterion, 
would make very crude discriminations for each factor. Each item would 
ordinarily differentiate only two categories—those who pass it and those who 
fail it—for each trait measured. If we brought in a number of items to 
measure each factor, with differences in difficulty to overcome this defect, we 
should have virtually a battery of tests within a single test. 

The solution to the incompatibility of goals of reliability and validity is 
precisely as just suggested: to use a battery of tests rather than single tests. 
Reliability should be the goal emphasized for each test; validity the goal 
emphasized for the battery. Even in the single test some reliability should 
be sacrificed for the sake of well-graded measurements. It is strongly urged 
that, if possible, each test be designed to measure one common factor. It 
should be univocal, its contribution unique. In this way minimal intercorre- 
lation of tests is assured, which satisfies one of the major principles in multi- 
ple regression. It was also shown that when tests are univocal the various 
factors can be weighted in the best way to make each prediction. The 
univocal test will correlate less with a practical criterion than will a hetero- 
geneous test, but what we lose in validity for the single test will be more than 
made up by forming batteries which cover the factors to be predicted and in 
a more manageable manner. For the sake of "meaningful profiles also, a 
battery of univocal tests has no equal. 

Reliabilities and Test Batteries. If a composite score from a battery is to 
be used and not part scores from the components, as in a profile, it is likely 
that there is not much to be gained by achieving reliabilities for single tests 
higher than .60 or by having tests longer than 30 items each. The reliability 

of the composite score of independent tests will be approximately a weighted 

‘Dailey, J. T. Determination of optimal test reliability in a battery of aptitude tests. 
Technical memorandum No. 10, Lackland Air Force Base, 1948, 
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average of the reliabilities of the components.! This means that if the com- 
ponents have a generally low reliability, in such a battery the reliability of 
the composite will be low. This need not be disturbing, provided the 
validity of the composite is high. To the extent that the components are 
intercorrelated, the reliability of the composite will exceed the average relia- 
bility of the components. In general, if there is a choice between lengthening 
of tests in a battery to make them more reliable and adding more tests of 
different kinds that contribute unique valid variances, the decision should 
certainly go to the second alternative. If part scores are to be used sepa- 
rately, however, attention must also be given to reliability of components. 
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Fro. 18.3. Proportion passing an item (responding correctly) as a function of ability level 
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Discrimination Values of Items. Some of the points just discussed may be 
made a little clearer if we approach the item theory from a still different 
aspect. Figure 18.3 is used to illustrate this approach. Imagine a scale of 
ability or of any other trait that we attempt to measure by means of a test. 
We want each item to correlate with that variable, to predict the status of 
individuals with respect to the variable, to discriminate between individuals. 

Suppose we already know the positions of large numbers of individuals on 
this scale. We apply to them an item that we will call item C. The item is 
of median difficulty, for of the entire group 50 per cent respond in the accepta- 
able manner and 50 per cent do not. According to the requirements of good 


1 Mosier, C. I. On the reliability of a weighted composite. Psychometrika, 1943, 8 
161-168. 


474 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION cn. 18 


reliability, this knowledge about the difficulty of item C is promising, but not 
sufficient evidence that the item would contribute to a reliable test. We do 
not yet know whether it is at all related to the variable we want to measure. 
It could be of median difficulty and still be uncorrelated with other items in 
thetest. Let us subdivide the large sample into subsamples grouped in class 
intervals as if for known values along the scale. We are now interested in 
seeing whether those groups higher on the scale have any greater probability 
of passing the item than those lower on the scale. Theory states, and experi- 
mental evidence supports the idea, that the increase in the probability of 
passing the item follows the normal cumulative frequency curve. The 
regression of proportions passing the item upon ability is the S-shaped or 
ogive form. For item C, not very far below average ability we find a point 
below which none pass the item. Above a point just as far above the mean 
we find that all pass the item. The interval between is sometimes called the 
transition zone, a concept borrowed from psychophysics.* 

Other items may have the same difficulty level as item C, but like items B 
and A in the diagram (Fig. 18.3) they have different degrees of discriminating 
power. Both Band A have much wider transition zones (they both actually 
go beyond the range of the given horizontal scale) and their curves have 
slopes that are less steep than that for C. The steepness of the slope is known 
as the curve’s precision. The term applies well here because the steeper the 
precision of the curve, the greater is the precision of discrimination. A per- 
fectly discriminating item is D, whose slope is infinite. A nondiscriminating 
item is F, whose slope is zero. There is a mathematical relationship between 
the precision of an ogive like these and the correlation between the item and a 
good measure of the trait.? Item E would have a negative correlation with 
the variable to be measured. This would be an unusual event and would 
probably mean that the item was keyed wrong in scoring. Items like D 
would seem to be ideal; they are perfectly discriminating. But it can be 
seen how only one such item used alone would be almost futile, for it dis- 
criminates at only one point. 

The second diagram is more realistic and yet pictures a somewhat ideal 
situation. It shows a series of items about equally spaced as to difficulty and 
all with excellent discriminating power. With the extensive range of diffi- 
culty level, there could not be as high internal reliability as some might desire. 
But the possibility of accurately grading individuals on a continuous scale 
is greater because of that dispersion. To appreciate the full value of the 
items that depart from medium difficulty, one would need either to use a 
biserial r or a tetrachoric r in correlating item with total score or to make 
allowance for the effect of divergencies in difficulty upon the phi coefficient. 


1 Woodworth, R. S. Experimental Psychology. New York: Holt, 1938. P. 401. 
For proof of this, see Richardson, M. W. Relation between the difficulty and the 
differential validity of a test. Psychometrika, 1936, 1, 33-49. 
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Validity and the Length of Test. Since the homogeneous lengthening of a 
homogeneous test increases its reliability, in accordance with the Spearman- 
Brown formula, it will also increase its validity. If the change in length is by 
some ratio n (the new length divided by the old) the new validity of the test 
is estimated by the formula 


Төз = РЕНЕА (Validit y ofa e test increased (18.7) 
1 — fez in length n times) 
тейтш 


where 7и. = validity coefficient for predicting criterion Y from test X and 
7,4 = reliability of test X. 

A certain line-drawing test developed to predict creative abilities of stu- 
dents in a course in designing had a reliability of .57 and a correlation with 
teacher's ratings of .65.! If this test were made twice as long, what validity 
could be expected? Applying formula (18.7), 


Ty) = JS 


It would thus definitely pay to make this test longer and more reliable in 
order to improve its validity. 

If we wanted to know how much homogeneous lengthening is needed in 
order to achieve a desired level of validity, we could do this by solving 
formula (18.7) for », which gives 


(Ratio of new length of test for a required validity) (18.8) 


where the symbols are as defined for formula (18.7). 
If we wanted a validity of .80 for the line-drawing test, the revised length 
would have to be 
1—.57 
= 4225 
pi Rm 87 


= 48 


Whether it would be practical to devote nearly five times as much effort to 
this test is a question of policy that goes beyond statistical answers. 

Relation of Validity Coefficients to Errors of Measurement. When two 
fallible measures are correlated, the errors of measurement, if uncorrelated 


* Guilford, J. P., and Guilford, R. B. A prognostic test for students in design. J. appl. 
Psychol., 1931, 15, 335-345, 
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among themselves, always serve to lower the coefficient of correlation as com- 
pared with what it would have been had the two measures been perfectly 
reliable. We say that the degree of correlation has been attenuated. If we 
want to know what the correlation would have been if the two variables were 
perfectly measured, we must resort to the correction for attenuation, for which 
we have a formula 


(An intercorrelation corrected for attenuation) (18.9) 


Ў 
) 


where rz; and z,, = reliability coefficients of the two tests. 

The correlation obtained between a figure-classification test and a form- 
perception test was .36. The reliability coefficients for the two tests were 
.60 and .94, respectively. Applying formula (18.9), 

.36 


fos = JCO 


We should therefore expect the correlation between true scores in these two 
tests to be .48 rather than the obtained one of .36. 

In general, when making this correction for attenuation in both fallible 
tests, if we are dealing with two forms of the same test for purposes of finding 
reliability, there is a possibility of determining four intercorrelations between 
the two tests, i. e., each form of the one correlated with the two forms of the 
other. In this case, it is well to use all the information we can get concerning 
the intercorrelation of the two tests by computing the four coefficients and 
using their arithmetic mean as a better estimate of the numerator of the frac- 
tion in formula (18.9). 

Factorial Explanation of Attenuation and Its Correction. It may not be 
clear to the reader why errors of measurement always lower intercorrelations, 
and why, when the corrective formula is applied, correlations should not be 
perfect. The answers to both of these questions can best be given by refer- 
ence to factor theory. 

Consider test 1 and criterion J of the illustration used above when factor 
theory was introduced. Error variance made up 18 per cent of the total 
variance of test 1 and 20 per cent of criterion J. Let us suppose that we could 
tid each variable of all errors of measurement, all error variance. In doing 
so, let us further suppose that the remaining true variance is expanded with 
all its components in proportion to their original amounts. Figure 18.4 
demonstrates what happens when the error components are “squeezed out” 
of variables and the true-variance components expand to take their places. 
Variances that were .36 and .36 in factors A and C in test 1 before correction 
become .439 and .439 after correction. The new factor loadings are .663 in 
each factor. In the criterion the corresponding loadings become .447 in place 
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of .40. By equation (18.6), the new correlation rj; becomes .59, whereas it 
was .48, The use of formula (18.9) applied to the original rj; gives 


= 59 


The change in validity from .48 to .59 is shown graphically in Fig. 18.4. 
Correction for Attentuation in the Criterion Only. The preceding device 
has limited application except in theoretical problems. In practice, we are 
compelled to deal with fallible tests. If the tests from which we wish to pre- 
dict something else are not perfect, that fact must be faced, and our predic- 
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Fic. 18.4. Se of variance in a test and a criterion after correction for attenuation 
(elimination of error variance statistically), also the contribution of factors to the validity 
coefficient before and after correction. 


tions are reduced in accuracy accordingly. But we should hardly expect to 
be asked to overlook the fallibility of the criterion we are trying to predict. 
If it measures success inaccurately, this lack of accuracy should not be per- 
mitted to make it appear that the test is less valid than it really is. It is 
Customary, therefore, to correct practical-validity coefficients for attenuation 
in the criterion measurements but not in the test scores. This one-sided 
correction is made by the formula 


ele (Validity coefficient corrected for attenuation in the (18.10) 
criterion only) 


Toz 
Tuy 


As an application of this formula, we cite the line-drawing test previously 
mentioned that correlated with a teacher’s rank-order judgments of creative 
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ability in her students in design to the extent of .65. The reliability of the 
teacher's ratings (combined from two rank orders a month apart) was found 
to be.82. Had the teacher's ratings been perfectly reliable measures of the 
thing she was judging, the correlation with test scores would have been 
.65/4/.82 = .72. The correlation of .72 is accordingly taken as the genuine 
validity of the test, unless we are concerned about predicting teacher's judg- 
ments, contaminated by flaws as they obviously are, rather than genuine 
ability as evidenced by those ratings. 

Many a validity coefficient reported in the literature is of very uncertain 
meaning because errors of measurement in the criterion were not taken into 
account. The reliability of ratings, even of the better ones, is character- 
istically about .60. With such criteria, validity coefficients are about 25 per 
cent underestimated. Too often the reliability problem of a criterion is 
entirely ignored. The writer has known of purported criteria of a perform- 
ance kind (bombing errors of bombardiers in training) which at best had 
reliabilities of only approximately .30. What is even more important, but 
incidental to the discussion here, is the validity of the criterion. Any investi- 
gator who hopes to develop successful selective instruments is often beaten 
before he starts, if he does not first ensure reliable and valid criteria, or if he 
does not estimate these features and make allowances for them. 

Limitations to the Use of Correction for Attenuation. The correction of a 
correlation for attenuation requires that we have a rather accurate estimate 
of reliability for each variable that enters into the situation. If either 2 or 
ғ. is underestimated, the corrected ½ will be overestimated. If either 
reliability index is overestimated, the corrected ry» will be underestimated. 
It is probably best, if one wishes to be on the conservative side, that, if 
anything, a reliability estimate should be too large when used for this 
purpose. On the other hand, it is likely that most estimates of internal- 
consistency reliability are too low, which is in the wrong direction for 
conservatism. 

There is also the question as to which of the three main types of reliability 
coefficient is desirable in correcting for attenuation. There are proponents 
for the use of each type in this connection. It is best to decide what kind of 
errors of measurement should be ruled out in the particular situation or 
particylar use of ru. Once this decision is made, the type of reliability will 
be selected accordingly, since it was shown in the preceding chapter that each 
type emphasizes certain sources of variance as error. The tendency of 
underestimation of 7, by internal-consistency methods is against their use 
where there is a reasonably good alternative. 

The Index of Forecasting Efficiency with a True Criterion. An index of 
forecasting efficiency (see Chap. 15) could be computed directly from r to 
denote the improvement in predicting the true criterion variable on the basis 
of knowledge of test scores over prediction without that knowledge. This 
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statistic can be calculated directly from the known 7’s, however, without first 
finding ror, by use of the formula! 


2 А : 
b. = 100(1— I) бшнш eny (is 15 


Standard Error of the Estimate of a True Criterion, Taking the correla- 
tion between our fallible scores and an infallible or true criterion as the coeffi- 
cient of validity, we shall also have smaller errors of prediction than if we 
tried to predict fallible criterion measurements. We could substitute r in 
the usual formula for finding the standard error of the estimate from 7, but 
the o (which now becomes oz) can be calculated directly from the original 
correlations by the formula 


ж = Oy Vin — Pus EET of estimate of a true (18.12) 


Validity of Right and Wrong Responses. Many tests are scored with a 
formula score in which the wrong responses are given a negative fractional 
weight and the right responses a weight of 4-1. 

A Priori Scoring Formulas. One of the reasons back of such scoring formu- 
las is the a priori reasoning about chance success and the need for correcting 
forit. Ina true-false test we have a two-alternative situation and the assump- 
tion is that when the examinee does not know an answer he will guess at 
random. When he guesses, his probability of getting the right answer is .5. 
When there are three alternatives, the theoretical proportion of right answers 
in guessing is .33; in a four-choice item the probability is.25, and so оп. This 
has led to the stock scoring formula of the form 


$=К— Шә (А test score with a priori correction for guessing) (18.13) 


= 
where Ё = number of right responses 
W = number of wrong responses 
k = number of alternative responses to each item 
In a true-false test this reduces to the familiar R — W. In a five-choice- 
item test it becomes R — W/4. Incidentally, a similar correction could be 
made by the general formula 


S=R+ 9 nes scoring formula with correction for guess- a 8.14) 
where O = number of omissions (including items not attempted). 


1 Conrad, H. B and Martin, G. B. The index of forecasting efficiency for the case ofa 
“true” criterion. J. exp. Educ., 1936, 4, 231-244. 
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It should be emphasized that neither of these formulas will tend to reduce 
the error variance introduced by guessing unless there are an appreciable 
number of omissions or failures to attempt items. If every examinee 
attempts all items, the correlation between R and W will be a perfect — 1.0, 
which offers no freedom for improvement by scoring formula. The formula 
scores would then correlate +1 with R and the correction operation would be 
of no value. In a speed test, however, and in a power test in which the 
examinees voluntarily omit many items, such a scoring formula may help to 
eliminate some of the error variance and thus promote better reliability and 
validity. The more difficult the test, the more important it is to apply the 
correction formula, for as difficulty increases the amount of guessing increases. 

If a scoring formula of this type is to be used in a test, and particularly if it 
is a power test, there should be explicit instructions to the examinees that 
there will be a deduction of a fraction of a point for each wrong answer (or a 
bonus of a fraction of a point for an omission). The second formula is 
naturally more palatable to examinees. But there are usually better scoring 
formulas than those based upon the a priori reasoning about guessing, as we 
shall see next. 

It might be pointed out, incidentally, that when examinees are ignorant 
of the answer to an item, their habits of taking tests are such that they do not 
choose among the alternatives entirely at random. Certain positions in a 
list of five responses may be favored by habits of reading or of attention. 
This is probably not sufficiently important in itself to overthrow the useful- 
ness of “chance” scoring formulas. In the long run, if the position of the 
right answer is randomized, the correction may work well enough. More 
serious, however, is the fact that many test writers, in preparing four- or five- 
choice items, do not provide “misleads” or “distractors” that are equally 
attractive. It is easy, perhaps, to think of one good wrong answer to an 
item, but to think of more than one and to make all equally attractive is a 
trying art. Many a four- or five-choice item reduces virtually to a three- or 
two-choice item because of this fact. The a priori scoring formula as given 
above then undercorrects. We do not know by how much. 

Empirical Weighting of Right and Wrong Answers. When R and W scores 
are not too highly intercorrelated, and when there is a practical criterion, it 
often pays to treat the two as if they were two different variables, as if they 
had arisen from two different tests. One then applies the multiple-regression 
procedures and derives optimal weights which will maximize the correlation 
of a weighted combination of R and W scores and the criterion. Since, as 
pointed out before, it is the relative sizes of the weights that are important 
and we do not care whether the formula scores have the same mean as the 
criterion or represent predictions in proper sizes, we can let the R score have 
a weight of +1 and find what weight the W score must then have. We should 
expect it to have a fractional negative weight, though it might differ markedly 
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from the weight given by formula (18.13). For this purpose, Thurstone has 
given the following equation to determine the weight for the W score: 


“е Gr (retur — Yew) (Optional weight for error scores when weight (18.15) 
Gu(Teufur — Ter) or rights scores is +1) s 
where the subscripts c, r, and w stand for criterion, rights, and wrongs scores, 
respectively. The correlation between these formula scores and the criterion 
is given by the usual multiple-R formula for three variables. In symbols 
that apply here, 


2 28 (Correlation of optimally 
Rh. = Ther F tee — Drofesfur weighted formula score (18.16) 
12 with a criterion) 


where the subscripts are as defined above. Note that this gives R?. 

The application of these formulas sometimes leads to surprising results. A 
two-choice numerical-operations test, a fairly simple and unique measure of 
the factor known as facility with numbers, should have had a scoring formula 
of R — 3W to yield maximal validity for the selection of navigators in the 
AAF. Another, five-choice, numerical-operations test should have had a 
weight of —2 for wrong answers. Thus the importance of accuracy was 
much greater than the a priori weights would have provided for. For the 
selection of bombardier students, the weight for wrong responses should have 
been about —.5 for the two-choice items and about zero for the five-choice 
items, for maximal validity of the test. For the bombardier criterion, 
accuracy was of relatively less importance than for the navigator. 

For still other tests, there were results deviating from a priori weighting, 
for example, one test involving estimations of lengths or distances on a map 
seemed to require a positive weight for wrong answers, for maximal validity 
for pilots, indicating that speed was of great importance in this test, even at 
the expense of accuracy. 

On the whole, the experience with scoring formulas tended to show that 
empirical formulas give validities slightly better than a priori weighting of 
wrong responses, with gains of the order of .02 to .03 being typical. On the 
whole, optimal weighting of wrongs gives increases of the order of .03 to .06 
over validities for the rights scores used alone. There are some instances 
when the optimal weight for W is zero. 

In Fig. 18.5 are shown the relationships between validities of formula 
scores in three different tests and different weights for wrongs scores in those 
tests when the rights scores are weighted +1. Not only can we see that there 
is an optimal weight for the wrongs scores for each test (.0 for test 1, —1 for 
test 2, and approximately —3 for test 3) but also that some weights would be 
detrimental to validity. These various validities can be estimated by using 
195 Thurstone, L. L. The Reliability and Validity of Tests. Ann Arbor, Mich. : Edwards, 

31. P. 80. 
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the correlation-of-sums formulas given in Chap. 16. The validity of each 
test when scored for number of right responses only can be noted at the place 
where v = 0. The amount of gain by optimal weighting can be noted by 
comparing this validity with the peak of the curve. There is no very marked 
change in validity for various negative weights up to —.5. An error in 
weighting in the negative direction would apparently not be very serious. 
But validity drops much more rapidly if the error in the weight is in the other 
direction—precipitously, sometimes—if the weight goes on the positive side. 
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Common-sense reasoning would ordinarily not permit us to choose a positive 
weight for the wrongs. 

Empirical scoring formulas should not be derived unless samples are quite 
large. In some combinations of correlations among C, W, and R, the weight 
is very sensitive to minor errors in any one of the three correlations involved 
and may be unreasonable on the face of it. When in doubt, it is best to be 
conservative. It may help to plot a curve for a test, after assuming different 
weights for W and solving the correlation ге. by the formula for correlation 
of sums (16.25). 

Factorial Validity of Rights and Wrongs Scores. The procedure for maxi- 
mizing practical validity for a test by using the proper scoring weights can 
also be applied to maximizing the correlation of a test with a factor in other 
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words, in increasing its loading in a factor. Recent experiences show that 
error scores might well be given much attention as sources of certain kinds of 
variance that it is worth our while to measure. Some AAF findings indicated 
that a trait of carefulness was quite measurable by using wrongs scores in 
several tests, whereas the number of right responses usually failed to measure 
it.“ 

Fruchter has more recently found by factor-analyzing rights scores and 
wrongs scores in the same tests that while the two scores in the same test may 
measure the same factors (in reverse), they do so to different degrees. He 
also found that some factors are more measurable by wrongs score than 
others.? In fact, it is possible that a certain kind of reasoning should be 
measured by errors rather than by correct solutions. These results have not 
been verified as yet, but they are suggestive of the rich possibilities there may 
be in fuller use and weighting of wrong responses. 

Validity of Items and of Their Composite. There have been proposals that 
each test be regarded as a battery and that its items be weighted according to 
the multiple-regression equation. The method is, of course, impractical in 
tests of any useful length. The result would also run counter to the goal of 
maximal reliability and uniqueness for each test. 

There are many tests of interest and of temperament, however, in which 
differential weighting of items and of responses to items is common practice. 
This is because some items are very much more diagnostic of the criterion 
than others when they are taken alone. It is desired to give the better items 
full representative voice in the multiple prediction. A number of weighting 
procedures have been used, all of which involve some index of validity of the 
element (item or response). They make this much of an approach to apply- 
ing the multiple-regression principles. 

The Importance of Weighting Item Responses. There are instances in which 
weighted scoring has materially improved reliability over that attainable 
with unweighted scoring. By “unweighted scoring" we mean here that each 
response is given a value of 0 or 1 only. Studies of validity have generally 
not shown much benefit from differential weighting of items. Any benefits 
from weighting are likely to be secured in short tests (20 items or less) only. 
Every test constructor, in these days of machine scoring, in which differential 
weighting is bothersome, should be challenged to show good cause for 
other than the simplest system of weighting. 

Selection of Items by Correlating with an Outside Criterion. Some tests, for 
example, the Strong Vocational Interest Blank, have been developed by cor- 
relating each item with an outside criterion. The outside criterion may be 


1 Guilford, J. P. Printed classification tests. Chap. 25. 

з Fruchter, B. Differences in factor content of rights and wrongs scores. Psychometrika, 
1953, 18, 257-265. 

For methods of weighting responses to items, 
2d ed. New York: McGraw-Hill, 1954. 


see Guilford, J. P. Psychometric M. ethods. 
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success in adjustment, vocational, marital, or personal. Any of the correla- 
tion methods appropriate with items may be used. Weights for scoring may 
be attached to responses by one of the accepted methods, The result is 
likely to be a valid score for the particular purpose and within the particular 
population on which the item validation was performed. Use of the score 
Íor other purposes and with other populations has to be defended by new 
empirical evidence of validity. It is probably important, also, to keep 
accumulating evidence of validity within the area of the test's original 
development. 

This procedure is describable as a kind of “shotgun” approach. It gets 
practical results without much knowledge of why there is validity. For 
example, the AAF developed a Biographical Data Blank composed of items 
of information about the student's previous life and experiences.! By corre- 
lating every response to a large number of experimental items with the 
graduation-elimination dichotomy in pilot training and also in navigator 
training, two scoring keys were derived, each valid for its own purpose. 

One could be content with these new, unique contributions to prediction of 
training success. On the other hand, one could well be curious as to the 
underlying reasons. Correlational studies and factor analyses revealed that 
the pilot score was valid chiefly because it indicated the effectiveness of the 
student’s background of experience in mechanical matters and because it 
revealed his interest or motivation for pilot training. To a much smaller 
extent it revealed the student's status in perceptual speed and in psychomotor 
coordination. These were represented in the pilot criterion also. The 
navigator score was valid, however, primarily because it revealed the stu- 
dent's background experience in mathematics and to a small extent his num- 
ber facility. Once the major sources of validity for each score are recognized, 
one is in a position to improve measurement of them. As a matter of fact, 
as in the example of the biographical-data approach, there often prove to be 
better measures of the significant factors, or better measures can be developed 
to replace the preliminary ones. 

It is to be recognized that in an unknown sphere of prediction much progress 
can be made by the “shotgun” approach, of correlating a large number of 
items with an outside or practical criterion. It is recommended, however, 
that we attempt to get past this stage as soon as possible, finding out the 
underlying reasons for successful prediction, and improving the measuring 
instruments needed. Where requirements are known in terms of factorial 
information, the development of univocal tests is called for, and this means 
item-test correlation rather than item-criterion correlation, 


Exercises 


Give your conclusions and interpretations in connection with each of the following 
problems: 
! Guilford, J. P. (ed.) Printed classification tests. Chap, 27. 


Е. 


-— 
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1. Two tests, Xi and Xs, and a criterion J have loadings in factors A, B, C, and D, which 
are uncorrelated with one another. The loadings and corresponding reliabilities are as 


follows: 


Compute: (z) communalities; (b) proportions of specific variances; and (c) inter- 
correlations. 
2. Test X has a reliability coefficient of .92, and criterion Y has a reliability of .65. 
Assume that the validity coefficient in each of four uses has values of .35, .48, .61, and .72. 
a. Determine the probable correlation between the true“ test scores and the “true” 
criterion measures in each situation. 
b. Determine the validity of the fallible test for predicting the “true” criterion in each 
situation. 
3. In the preceding problem, assume that oy = 15.0, Compute dwz and Ls for the four 
instances. 
4. Four homogeneous tests have reliability coefficients and validity coefficients as 


follows: 


a. Estimate the validity coefficient in each case, assuming that each test is doubled in 
length, 
b. Do the same, assuming that each test is made five times as long. 
c. Do the same, assuming that each test is made half as long. 
5, How long (in ratio to original lengths) would it be necessary to make tests X; and X, 
in Exercise 4 in order to make the validity coefficient of each test .60? 
6. Assume the following data for a certain test: 


o, = 100 о, = 4.0 fe = 3 Too = — 2 tor = — A 


“wrong,” and “criterion” scores, respectively). 


(where the subscripts stand for “right,” * 
he wrong responses (W), when the right responses 


a. Compute the optimal weight for t 
(R) are weighted +1. 

b. Compute the correlation of 
criterion measures (C). 

c. Assume in turn arbitrary weights of 
a weight of +1.0 for right responses) an: 
weighted combinations. 


scores obtained by use of these weights with the 


—2.0 and +1.0 for the wrong responses (with 
d estimate the correlation with C for such 
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Answers 


1. }#:.53,.87, .30; 52.: 27, 00, .35; r12 = .40; riy = .36; ray = 24. 
2. a. Tew: 45; .62; 79; .93. 
b. тыз: 43; .60; .76; .89. 
3. gue: 10.9; 9.7; 7.9; 5.4. 
Eus: 9.9; 19.6; 34.6; 55.0. 
4. а. .74; .53; 56; .32. 
b. 76; .55; .61; .33. 
:64; 46; 42; 27. 
5. n: 0.15; 4.24. 
6. (a) v = —0.53; (0) К = .31; (c) fe: .30, .24. 
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CHAPTER 19 


TEST SCALES AND NORMS 


In this chapter we consider in some detail the problem of measurement by 
means of test scores. In previous chapters where test scores played a role, it 
was usually assumed that they approximated scales with equal units; that 
equal increments of numbers correspond to equal increments of psychological 
quantity. Such an assumption is necessary for the meaningful application 
of most statistical operations. When a test is composed of many items and 
when it is of an appropriate level of difficulty for the population examined, 
this assumption is fairly sound. 

In the following pages we shall consider some ways of transforming raw- 
score scales into other scales for various reasons. One objective is to effect a 
more reasonable scale of measurement. Another important objective is to 
derive comparable scales for different tests. The raw scores from each test 
yield numbers that have no necessary comparability with numbers from 
another test. There are many occasions for wanting not only comparable 
values from different tests but also values that have some standard meaning. 
These are the problems of test norms and test standards. 

Why Common Scales Are Necessary. Aside from a few tests that yield 
scores in terms of physical-stimulus values (such as tests of sensory acuity) 
or of response values (such as time, distance, or energy values), most tests 
yield numerical values that have no particular significance. There was a 
time when scores were given in terms of percentages. The tradition of 
grading examinations in terms of percentage of right answers still has popular 
appeal, in spite of the many experimental demonstrations that such per- 
centages are neither accurate nor meaningful. The method gave a feeling 
(definitely fallacious) of having some kind of an “absolute” measure of the 
individual. Itis difficult for even the better informed student to free himself 
from this traditional thinking, even when he has given up the operations it 
implies. 

Tf modern psychology and education have taught anything about measure- 
ment, they have amply demonstrated the fact that there are few, if guy, 
absolute measures of human behavior. The emphasis has shifted fro; 
search for absolute measures to an emphasis upon the concept of indiv: 
differences. The mean of the population has become the reference 

487 


488 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION [сн. 19 


and out of the differences between individuals has come the basis for scale 
units. Even when the test happens to yield such objective scores as those in 
time, or space, or energy units, it is sometimes doubted that such units, 
though unquestionably equal from a physical point of view, really represent 
equal psychological increments along scales of ability or talent. These con- 
siderations, among others, send us in search of more rational and meaningful 
scales of measurement for behavior events. 

In addition to the more theoretical demands just mentioned, there is the 
very practical consideration that scales for different tests should be com- 
parable. The most obvious need for comparable scales is seen in educational 
and vocational guidance, particularly when profiles of scores are utilized. A 
profile is intended to give a picture of an individual. We would hardly bother 
to prepare one for an individual if we did not expect to make very direct 
comparisons of the person's levels in different traits. The comparisons of 
trait positions for the same individual would be misleading, if not worthless, 
if there were not at least reasonable comparability of levels for different scores 
going under the same numerical value. 

No informed person would think of using raw scores as a basis of making 
direct comparisons among an individual's positions with respect to trait 
variables. Conversion of raw scores to values on some other common scale 
is essential. The use of centile-rank positions was mentioned in an earlier 
chapter (Chap. 6). Centile values are suitable to the extent that they do 
make possible comparable values for different tests, they do use the mean (or 
median) as the main reference point, and they are easily understood by the 
layman. They serve their best purpose when measurements must be inter- 
preted to the layman. But, for reasons which were stated earlier (Chap. 6), 
centile values have limitations which make them fall short of full usefulness 
to those who expect something more of measurements. Centiles, after all, 
are rank positions and do not represent equal units of individual differences, 
It is possible to have scales that probably provide units of equal size as well as 
comparability of means, dispersions, and form of distribution. 

Some Common Derived Scales. The chief interest in what follows will be 
in such scales—those which achieve comparability of means, dispersions, and 
form of distribution. We shall not go into the very popular mental-age con- 
cept or the JỌ scale. As simple as those ideas may be, the achievement of a 
battery of tests which will meet the requirements of age equivalents and 
appropriate distributions of ГО involves statistical problems of an intricate 
nature which we cannot go into. Treatment of these problems may be 
found in references to McNemar and to Marks.! The three kinds of scales 
to be discussed here are the standard-score scale, the T scale, and the C scale. 

'McNemar, Q. The Revision of the Stanford-Binet Scale. Boston: Houghton Mifflin, 


1942; Marks, E. S. Sampling in the revision of the Stanford-Binet scale. Psychol. Bull., 
1947, 44, 413-434, 
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Their application to derivation of test norms and profile charts will be given 
attention. The treatment will be kept at a rather elementary level, empha- 
sizing basic concepts. For a more advanced treatment of some of these 
problems the reader is referred to a discussion prepared by Flanagan.“ 


STANDARD SCORES 


An Example of the Need for Comparable Scores. A concrete example will 
illustrate some of the ideas expressed above. A student earns scores of 195 
in an English examination, 20 in a reading test, 39 in an information test, 139 
in a general scholastic-aptitude test, and 41 in a nonverbal psychological test. 
Is he therefore best in English and poorest in reading? Could he perhaps be 
equally good in all the tests? From the raw scores alone, we can answer 
neither of these questions nor many others that could be legitimately asked. 
This student’s five scores just cited will be seen listed in column 4 of Table 
19.1 (student I). Knowing the means of students in the five tests helps some, 


Taste 19.1. A COMPARISON OF STANDARD SCORES WITH Raw Scores EARNED BY 
Two STUDENTS IN Five EXAMINATIONS 


(1) (2) (3) (4) (5 (6) 
X 2 
Stand- | Raw У Standard 
AAN ard Deviations 
Examination Mean Р scores scores 

devi- 

ation I I п 
English... ‚| 155.7 | 26.4 | 195 +1.49 | +0.24 
Reading. 33.7 8.2 20 —1.67 | 4-2.48 
Information........| 54.5 9.3 39 —1.67 | +1.88 
Scholastic aptitude.| 87.1 | 25.8 139 +2.01 | —0.12 
Psychological. 24.8 6.8 41 +2.38 | +0.03 

ОБ 434 +2.54 | +4.51 


since they serve as norms or comparable reference points. The means are 
listed in column 2. We now see that the student is well above average in 
English and in scholastic aptitude and is somewhat below average in reading 
and information, just as the numbers seem to indicate at their face value. 
The second student, whose raw scores are also in column (4), is numerically 
highest in the same two and lowest in the same three. When we consider 
the averages again, however, we find that student II is only about average in 
English, in scholastic aptitude, and in the psychological test, but he is above 
average in reading and in the information test. 

1Flanagan, J. C., in Lindquist, E. F. (ed.). Educational Measurement. Washington, 
D.C.: American Council on Education, 1951. Chap. 17. 
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When a student is above the mean in two tests, in which one is he actually 
superior? Student I is 39.3 points above the mean in English and 16.2 points 
above the mean in the psychological test (see column 5 of Table 19.1). Is his 
superiority in English really greater than his superiority in the psychological 
test? Student II is 20.3 points above the mean in reading and 17.5 points 
above the mean in information. Is he about equally superior in the two 
tests? 

And how do the two students compare? The superiority of student I is 
apparent in three tests (English, scholastic aptitude, and psychological) and 
that of student II, in the other two tests. This we can tell from the raw 
scores, But suppose the two were competing for a scholarship at a uni- 
versity; which one, if there is to be a choice between the two, should win? 
The totals of the five scores are 434 and 397, in favor of student I. Granting 
that the five different abilities are equally important, have we done justice by 
comparing sums of raw scores? Are we justified in finding an average of each 
student’s five raw scores? . 

Suppose that we were interested in determining which student is the more 
consistent in his abilities, as shown by these five tests, and which one has the 
greater variability within himself. Would a comparison of the average 
deviations or standard deviations of the five raw scores give us the answer? 
As the reader has probably guessed, the reply to most of these questions is in 
the negative. We are extremely limited in making direct comparisons in 
terms of raw scores for the reason that Taw-score scales are arbitrary and 
unique, We need a common scale before such comparisons as we have 
called for can be made. Standard scores furnish one such common scale. 

The Nature of a Standard-score Scale, A standard-score scale is one that 
has a mean of zero and a standard deviation of 1.0. The unit of the scale 
might be taken as 1c, or as 0.10, or any other arbitrary fraction of the stand- 
ard deviation. An illustration of the conversion of a raw-score scale into a 
standard scale is shown in Fig. 19.1, A, B, and C. Distribution 4 is based 
upon the original, or Taw, scores. The mean is 80 and standard deviation is 
14.0. The distribution is obviously somewhat negatively skewed. 

As we have previously seen, a standard score z is derived from a raw &core 
X by means of the formula 


TM" x (Standard score s corres, ndin; 
= 2 to a rar X 
a т т and to a deviation х) $c гах воре (19.1) 


An intermediate step between the raw-score scale and the standard-score 
scale is the deviation X — M,orz. This step is illustrated in Fig. 19.1 B. 
Deducting the mean from every raw score has the effect of shifting the entire 
distribution down the same scale so that the mean is zero. The final step, 
arriving at the z scale, is shown in Fig. 19.1 C. Distribution C is drawn so 
that the mean is directly beneath that in distribution B, both at zero, and so 
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that deviations of 14 units on the original scale correspond with deviation of 
lo on the standard scale. Especially to be noted is the fact that the form of 
distribution has not changed; it is still skewed exactly as it was originally. 
This procedure does not normalize the distribution as some other scaling 
procedures do. 

Application to Comparisons of Scores. The two students represented in 
Table 19.1 will now be compared in terms of their standard scores. Before 
we take these comparisons very seriously, however, we must consider two 
possible limitations to this procedure. Applying formula (19.1), we arrive at 


B A 


ЕЕЕ 5 
740 -30 —20 -0 0 10 20. 30 40 50 ) 70 80 90 100 10 


Deviation-score scale (x) Orig nal score scale (X) 
Mean=0 o=14.0 Mea -80 o=14.0 
* [4 
Es 


-30 -20 -0 0 +10 +20 +30 
Standard-score scale (2) 
Mean =0 с= 1.0 


20 30 40 50 60 70 
T-score scale (normalized) 
D Mean=50 с=10.0 


— WS 
10 20 30 40 50 60 70 80 
A standard scale (not normalized) 
Mean=50 c -10.0 
Te, 19.1. Distributions before and after conversion from a raw-score scale to a standardized- 
Score scale with a desired mean and standard deviation, with and without normalizing 
the distribution. 


the standard scores in column 6 of Table 19.1. For accurate comparisons 
between different tests, there are two necessary conditions to be satisfied. 
The population of students from which the distributions of scores arose must 
be assumed to have equal means and dispersions in all the abilities measured 
by the different tests and the form of distribution, in terms of skewness and 
kurtosis, must be very similar from one ability to another. 

Unfortunately, we have no ideal scales common to all these tests, measure- 
ments which would tell us about these population parameters. Certain 
selective features might have brought about a higher mean, a narrower dis- 
persion, and a negatively skewed distribution on the actual continuum of 
ability measured by one test, and a lower mean, a wider dispersion, and a 
symmetrical distribution on the continuum of another ability represented by 


. 
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another test. Since we can never know definitely about these features for 
any given population, if we want to achieve communality of scales at all 
(standard or any other), we often have to proceed on the assumption that 
actual means, standard deviations, and form of distribution are uniform for 
all abilities measured. In spite of these limitations, it is almost certain that 
derived scales, such as the standard-score scale, provide us with more nearly 
comparable values than do raw-score scales. The recognition of these 
limitations, however, should be admitted and interpretations based upon the 
use of standard scores should be made with appropriate reservations in line 
with those limitations. 

Returning to Table 19.1, with the standard scores we have for the two 
students, we can now give more satisfactory answers to the questions raised 
above about these students. Student I is most superior in the psychological 
test, next in scholastic aptitude, and third in English. Had we judged this 
by his deviations from the mean, we should have decided that his order of 
superiority was scholastic aptitude first, English second, and psychological 
third. We find that in terms of standard scores he is equally deficient in 
reading ability and information, whereas the deviations would have placed 
him lower in information than in reading. Student II's five standard scores 
come in about the same rank order ag do his deviation scores but certainly 
not in the same order as his raw scores. 

When comparing the two students in terms of raw scores, we should con- 
clude that student I has the greatest advantage in number of pojnts in 
scholastic aptitude; in terms of deviations, this would be the same, but in 
terms of standard scores it is in the psychological test that the advantage is 
greatest. Student II has about the same superiority over student I in the 
reading and information tests in terms of raw scores and deviations but has 
decidedly greater superiority in reading ability in terms of standard scores. 
When we compare the two students as to total or average score, whereas the 
raw-score total gives student I the distinct advantage of 37 points, or an 
average superiority of about 7 points, the standard-score averages reverse the 
order and give student П a 0.39¢ lead. In a scholarship contest, we should 
conclude that student II has the greater all-round ability as indicated by 
these tests, when students are compared on a standard-score basis. 

Disadvantages of Standard Scores. Although standard scores will do for 
us all that we have said and more, under the proper conditions, there are 
several things about them which make them less convenient than some others. 
One shortcoming is the fact that half the scores will be negative in sign, which 
makes things awkward in computation. Another disadvantage is the very 
large unit, which is one standard deviation. 

We could, of course, overcome the first shortcoming by adding a constant 
to all the scores to make them all positive, and we could multiply them by 
another constant, preferably by 10, to make the unit smaller and the range in 
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total units greater. If we did both of these, we could achieve almost any 
mean and standard deviation we wanted, depending upon the choice of 
constants. If we wanted a mean of 50 and a standard deviation of 10, we 
would multiply every standard score by 10 and add 50. 

Direct Scaling to a Desired Mean and Standard Deviation. This brings 
us to a more general procedure. If we knew from the time that we had 
acquired the distribution of raw scores that we were to convert them to a 
common scale with a certain mean and standard deviation, we should not 
go to the trouble of converting first to standard scores, then to the new scale. 
We can do the operation in one step by the equation! 

X, (©) X ") М, A ei hal up a (19.2 
ia com ee ou 1 2) 
where X, — a score on the standard scale, corresponding to X, 
X, = a score on the obtained scale; raw score 
M, and M, = means of X, and X,, respectively 

o, and o, = standard deviations of X, and Х,, respectively 

If the desired mean is 50 and the desired standard deviation is 10, with these 
substitutions the equation becomes 


ORRORE 


Knowing o, and M, from the particular distribution of raw scores, the equa- 
tion reduces to very simple form describing a straight line. Taking the 
illustration of Fig. 19.1, where M, = 80 and e, = 14.0, 


10 IN 
x (a) x- [5-9 
= 4X, — 7.12 


A raw score of 100 would, by this formula, become a scaled score of 64. A 
raw score of 50 would become a scaled score of 29, We can see a graphic 
exhibition of this transformation by relating distributions A and D in Fig. 
19.1. A score of 100 in A is in a position comparable to a score of 64 in D, 
and a score of 50 in A is in a position similar to 29 in D. 

Scaling by this procedure, as by the standard-score method, assumes that 
the obtained form of distribution is the same as the population distribution. 
If this is true, then it is probable that units on the derived scale are equal, 
also those on the raw-score scale. So far as improving the equality of units 
is concerned, then, nothing has been gained, nor was anything to be gained. 
We know, however, that the form of distribution of a sample is not necessarily 
the form of distribution of the population. The discrepancy need not be, 

For the derivation of this type of equation, see Appendix A. 
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and probably is not, due to sampling errors, particularly if the sample is large. 
There are many reasons for radical departures of sample distributions from 
genuine population distributions of the trait measured: difficulty level of the 
test, intercorrelation of the items (see Chap. 17), and the variations in diffi- 
culty and intercorrelation. We should not, therefore, feel too obligated to 
retain the same form of distribution in scaled scores as in the raw scores. If 
there is a real discrepancy between population distribution and sample dis- 
tribution, there is much room for improvement of the scale in terms of 
equality of units. The next methods to be described have the probable 
advantage that by normalizing distributions they also achieve better metric 
Scales. 


Tur T SCALE AND T SCALING Or Tests 


The well-known T scale overcomes the objections raised against standard 
Scores and adds besides an advantage peculiar to itself. It adopts as its unit 


“Бү "4d e дй 0 tlo +20  *àg +47 +50 
Standard scores 

0 10 20 30 40 50 60 10 80 90 100 
T-Scale scores 


Fro. 19.2. The 7 scale and its relation to the standard-score scale extending over a range of 
102. 


one-tenth of а standard deviation, so that an ordinary distribution with a 
range of 5 to 60 on its base line yields 50 to 60 integral 7-scale scores. In 
addition, the T scale goes beyond any ordinary distribution, extending over 
a spread of 10 standard deviations, or 100 units in all. 

Any age or grade group would yield its own distribution extending 5 to бе. 
A group just higher in ability would overlap this one and yet would need an 
extension over new units beyond the limit of the first group. A third group 
of lower age would need an extension of the measuring stick at the other end. 
When all groups from lowest to highest are taken into account, considerable 
extension is required. The result, with these extensions, is a single common 
Scale on which all groups, over a wide range, have a common unit and a com- 
mon zero point. It has been found in practice that a scale with 100 units 
(or 107) will be extensive enough. It is based upon a normal curve whose 
tails extend from — 50 to +50 (see Fig. 19.2). Besides making the unit equal 
to 0.10, the T scale also has the zero point at the extreme left, which places it 


OS 
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at — 50. The mean now becomes 50, and the other 7-scale points are spaced 
as in Fig. 19.2. 

How to Derive T-scale Equivalents for Raw Scores. A college or uni- 
versity or a single school system may wish to use the T-scale idea as its com- 
mon yardstick for all its tests. The freshmen entering a large university, for 
example, may be taken as the standard group for this purpose. As an illus- 
tration, let us use the data in Table 19.2. Here is a distribution of 83 scores 


TABLE 19.2. THE CALCULATION or T SCORES FOR A DISTRIBUTION OF 
ENGLISH-EXAMINATION SCORES 


a) (2) (3) (4) (5) (6) 
Scores Upper limit ЖОШО Cumulative | Cumulative | T score (from 
of interval frequency | proportion | Table 19.3) 

225-229 229.5 1 83 1.000 = 

220-224 224.5 0 82 .988 72.6 
215-219 219.5 1 82 988 72.6 
210-214 214.5 5 81 .976 69.8 
205-209 209.5 5 76 ‚916 63.8 
200-204 204.5 7 71 .855 60.6 
195-199 199.5 6 " в 771 57.4 
190-194 194.5 6 58 700 55.2 
185-189 189.5 6 52 .627 53.2 
180-184 184.5 11 46 .554 51.4 
175-179 179.5 9 35 422 48.0 
170-174 174.5 5 26 .313 45.1 
165-169 169.5 5 21 .253 43.3 
160-164 164.5 6 16 .193 41.3 
155-159 159.5 5 10 .120 38.2 
150-154 154.5 2 5 .060 34.5 
145-149 149.5 1 3 .036 32.0 
140-144 144.5 1 2 .024 30.2 
135-139 139.5 0 1 .012 27.4 
130-134 134.5 1 1 ‚012 27.4 


obtained by freshmen in an English examination of the objectively scored 
type. The procedure will be described step by step: 


Step 1. List the class intervals as usual. Here a large number of class inter- 
vals is desirable. 

Step 2. List the exact upper limits of class intervals. 

Step 3, List the frequencies. 

Step 4. List the cumulative frequencies (see Chap. 6 for instructions). 

Step 5. Find the cumulative proportions for the class intervals. 
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Step 6. Find the corresponding T scores from Table 19.3. These are then 
listed in the last column of Table 19.2, given to one decimal place. 
We usually want finally a ready means of reading directly the T score 
corresponding to any integral raw score. It is recommended that the 
remaining steps be taken to satisfy this objective. 


Taste 19.3. A TABLE то Am IN THE CALCULATION Or T SCORES 


Proportion below жою Proportion below Proportion below Deore 
the point the point the point 

0005 17.1 37.2 ‚900 62.8 
‚0007 18.1 38.3 ‚910 63.4 
0010 19.1 39.2 ‚920 64.1 
0015 20.3 40.1 .930 64.8 
0020 21.2 40.8 ‚940 65.5 
0025 21.9 41.6 950 66.4 
0030 22.5 42.3 ‚960 67.5 
0040 23.5 43.3 .965 68.1 
0050 24.2 44.8 .970 68.8 
0070 25.4 46.1 .975 69.6 
010 26.7 47.5 .980 70.5 
015 28.3 48.7 .985 71.7 
020 29.5 50.0 .990 13.3 
‚025 30.4 51.3 .993 74.6 
.030 31.2 52.5 .995 75.8 
.035 31.9 53.9 9960 76.5 
.040 32.5 55.2 9970 77.5 
.050 33.6 56.7 9975 78.1 
.060 34.5 57.7 9980 78.7 
.070 35.2 58.4 9985 79.7 
.080 35.9 59.2 .9990 80.9 
.090 36.6 59.9 .9993 81.9 

60.8 9995 82.9 

61.7 


Step 7. Plot a series of points to represent each T score in Table 19.2 corre- 
sponding to the upper limit of the class interval, as in Fig. 19.3. If 
the original distribution of raw scores is normal, the points should fall 
rather close to a straight line. The reason that they are not perfectly 
in line is that there are some irregularities in the original data. Draw 
through the points with a straightedge a line that will come as close 
to all the points as seems possible. Among those that do not touch 
the line, as many of them should be above it as below it. The line 
may be extended beyond the ends of the points at both ends. If the 
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raw-score distribution is skewed, the trend in the points when plotted 
will show some curvature. It is best, then, to attempt to follow the 
curvature but with a smooth trend. If the curvature is not followed, 
the distribution of the population on the scaled scores will not be 
normalized. 


80 
70 


е 
o 


= 
è 3 


T-score scale 


rw 
o o6 


10 
120 130 140 150 160 170 180 190 200 210 220 230 240 
Raw- score scale 


Fic. 19.3, A smoothing process applied in deriving T-scale equivalents for English-exami- 
nation scores (see Table 19.2.). 


Step 8. 


Step 9. 


) 


For any integral raw-score point, we can now find the corresponding 
T-score points. For example, in Fig. 19.3, a raw score of 220 corre- 
sponds to a T score of 70, and a raw score of 150 corresponds to a T 
score of 33. In this we favor integral T scores but at times have to 
resort to half points when we cannot decide upon the nearest unit. 

Prepare a table in which every integral raw score, or every second, 
third, or fifth one, appears in one column and the corresponding T 
scores in the other. Table 19.4 is such a tabulation. It will serve 


TABLE 19.4, RECTIFIED SCALING WITH T Scores РОА THE DISTRIBUTION OF 
ENGLISH-EXAMINATION SCORES 


Examination 
score score 
240 155 35.5 
235 150 33 
230 145 30 
225 140 27.5 
220 135 25 
215 130 22 
210 125 20.5 
205 120 17 
200 


— 6 
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for all future purposes of translation where the original tested group 
remains the standard. Many test users prefer to list every raw score 
and its T-score equivalent so as to avoid the need for interpolation. 


A Normal Graphic Procedure for T Scaling. It is possible to do more of 
the 7 scaling graphically by the use of normal-probability paper. This graph 
paper is especially designed with spacing for cumulative proportions along 
one axis in a manner consistent with the cumulative normal-curve function. 


0,999 


ERR EL LLLA 


ju ИИ 
| Munt d HM [n ШИШИШИ! 


English examination scores 
Fro. 194. A graphic solution to scaling, which utilizes normal-probability graph paper. 


Figure 19.4 shows how the English-examination data can be so treated. 
Using the cumulative proportions appearing in Table 19.2, column 5, we plot 
each one against its corresponding raw-score value given in column 2. The 
trend of the points will be in a straight line if the distribution of raw scores is 
normal. If that distribution is skewed there will be some curvature in the 
trend which one should try to follow in smoothing. To find the T equivalent 
for any raw score, we find that raw score on the base line, follow it up to the 
line drawn through the points, locate the equivalent proportion, then go to 
Table 19,3 for the corresponding Т. 

An Evaluation of the T-scale Procedure. The T scale is probably the most 
widely used of all derived scales. Its advantages are many, its disadvantages 
few. When the scaling is carried out, as described, the procedure normalizes 
distributions. This effect is pictured in Fig. 19.1.- Contrast distributions 


4 


ся. 19] TEST SCALES AND NORMS 499 


D and E in that illustration. Both have a mean of 50 and a ø of 10. The 
one is skewed like the original distribution, the other is normal. The nor- 
malizing process comes about through the conversion to centiles and then to 
corresponding deviations from the mean in a normal distribution. Table 
19.3 is based upon the normal curve. For a given proportion (area below 
a given point) is given a T-score equivalent instead of a standard-score 
equivalent. 

The normalizing process may be pictured as in Fig. 19.5. There the 
obtained distribution, seriously skewed, is given below, and the normalized 
distribution on the derived scale above. The process ensures that the areas 
A, B,C, . . . , M correspond in the proportions that they occupy with areas 


Fic. 19.5, A graphic illustration of what happens in scaling so as to normalize a distribution. 
Intervals are matched so as to equate corresponding areas under the curves. 


A', B., C, . . . ,M’. The correspondences of scale distances are also shown, 
by connecting dotted lines. If the units on the derived scale (not shown) 
represent genuinely equal increments of the measured variable, then obvi- 
ously those on the original scale do not. We may not know that the popula- 
tion is normally distributed on a trait, but by normalizing distributions, 
where there is no inhibiting information to the contrary, we achieve morc 
common and meaningful scores. 

Other advantages of the T scale have been mentioned —the possibility of 
extending it beyond limited populations, its convenient mean, unit, and 
standard deviation, and its general applicability. It has some limitations 
which should be pointed out. In much practical use of tests, as fine a unit 
as. lo may be an overrefinement. Much coarser discriminations are all that 
may be necessary. Furthermore, the unit may give quite a false sense of 
accuracy of the measurement that is actually being made. If the original 
scores had a standard deviation much smaller than 10—for example, one of 
five score units—then the substitution of a unit of .1c is in a sense “hair- 
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splitting.” Two whole units on the T scale are then as fine a distinction as 
we could actually make between individuals. 

Nor is this the whole story. Every test, even the best of them, has an 
error of measurement whose size is indicated by its *standard error of meas- 


TABLE 19.5. THE ELEVEN-POINT SCALED-sCORE SYSTEM AND Its APPLICATION TO 
THE MEMORY-TEST DATA 


(3) (4) (5) (6) (7) 
Corre- Memory- 
Percent- | Percent- | sponding | test scores 
C-scale Standard- | Centile- age within] age in  |scorepoints| in each 
score score limits|rank limits} each whole in the scaled- 
interval | numbers | memory score 
test interval 
vise pere Bites! Syste al ie o s reae We | v exse exo АТОИ 
10 0.9 1 41+ 
аА c 
9 2.8 3 38-40 
АР КААС! и C 
8 6.6 7 35-37 
VES DEI CTT . 
© 12.1 12 31-34 
DEUS LU УА УҢ, PHO OSes Oe О ec ta ВВЕ О 5: АА [ean entrant 
6 17.4 17 28-30 
МК ЖАЛКА, AE segnes eere cse Жу O УАВ ЕАУ АА ENSEM 
5 19.8 20 25-27 
MaRS ret er aes Ut ed ec cre res Eating ласы Е РАСОВА 
4 17.4 17 21-24 
E Ey eis |20075 .22.7 cea vss Posts ces 20.8 бу. 
3 12.1 12 18-20 
"diee МДА, CTC 
2 6.6 7 15-17 
E MS E et 1.75 ee e ТАА eee 
1 2.8 3 12-14 
ots prae ubt —2.25 CJ 
0 0.9 1 0-11 
RECAP ЛИКА Ир Heer] БӨРЕП tse Y [tuse АЛАЯ ёс ente НА AN ЫНДА 


urement” (see Chap. 17). This stems from the fact that the test is not per- 
fectly reliable. If the error of measurement is as much as two units on the 
raw-score scale, it might be even larger on the T scale. If the error is such 
that the best practical discriminations we can make between individuals is of 
the order of one-half c, it is rather presumptuous to apply a scale that pretends 
to distinguish to one-tenth v. For this reason, particularly, and because 
many test users require less refinement than the T scale offers, the writer has 
proposed the C scale, which will be described next. 
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Tue C SCALE AND C SCALING 


The C-scale System. The principles of the C scale and the derivation of 
C-scale equivalents for raw scores are illustrated in Table 19.5. The C scale 
is so arranged that the mean will be exactly at 5.0, with the two limiting 
classes being 0 and 10. Column 2 gives the exact limits of the 11 units in 
terms of standard scores. The corresponding centile limits (derived from 
Table B) are given in column 3. The percentage of cases within each unit 
is found by subtracting neighboring pairs of centile limits. Thus, in the 
middle unit, the difference 59.9 — 40.1 = 19.8, etc. Since it is more con- 
venient to think in terms of whole numbers, the approximate percentages of 
the cases falling in the different classes are given as nearest whole numbers in 
column 5. These can be used either as a guide in thinking of the make-up 
of the standard distribution or even in subdividing lists of scores of indi- 
viduals when arranged in rank order. Thus, if we had 100 persons lined up 
in rank order in a test, the highest person would be given the score of 10, the 
next three a score of 9, the next seven a score of 8; etc., until the last in line is 
given a score of 0. 

Steps in Deriving a C Scale. The operations for deriving a C scale are 
much the same as those for deriving a T'scale. There are some differences in 
the steps to be recommended, however, and so all the steps will be listed here. 


Step 1. List the class intervals. 

Step 2. List the exact upper limits of the intervals. 

Step 3. List the frequencies. 

Step 4. List the cumulative frequencies. 

Step 5. Find the cumulative proportions for the intervals. 

Step 6. From here on the steps differ from those for T scaling. Next, plot 
the cumulative proportions on the ordinate corresponding to X values 
(exact upper limits) on the abscissa of coordinate paper. (See 
Chap. 6 for further instructions.) 

Step 7. Draw by inspection a smooth S-shaped curve through the trend of 
the points. If the distribution is obviously skewed and one tail of 
the S is short, or even if it vanishes, follow the points anyway. At 
this stage one sees the advantage of having a liberal number of classes. 

Step 8. Look for each of the centile limits (from column 3 of Table 19.5) on 
the ordinate, find the intersection of that centile-rank level with the 
curve, drop down to the abscissa to locate the corresponding raw- 
score point. Try to avoid arriving at a point exactly at integers, so 
that it is clear whether each integral raw score goes above or below 
the division point. The values obtained from this step are like those 
in column 6 of Table 19.5. 

Step 9. Determine within which C intervals the various integral score values 
lie and write the limiting scores as in column 7 of Table 19.5. 
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Alternative Graphic C-scaling Steps. Tf one already has a figure drawn like 
Fig. 19.3 that is used in T scaling, one could use it to accomplish steps 6 and 7 
in the following manner. Тһе о for the T scale is 10 and that for the C scale 
is2. The means are 50 and 5, respectively. An interval of one unit on the 
C scale corresponds to five units on the T scale. А C score of 5, therefore, 
occupies a range from 47.5 to 52.5; a C score of 6 corresponds to a range 57.5 
to 62.5, and soon. All the T-score limits of the C intervals can be seen repre- 
sented in Table 19.6. The T-score limits, therefore, can be located in Fig. 


Taste 19.6. T Scores EQUIVALENT TO C-scORE INTERVALS 


C score | T-score limits | Middle 7 score 


10 72.5-77.5 75 
9 67.5-72.5 70 
8 62.5-67.5 65 
7 57.5-62.5 60 
6 52.5 57.5 55 
5 47.5-52.5 50 
4 42.5-47.5 45 
3 37.5-42.5 40 
2 32.5-37.5 35 
1 27.5-32.5 30 
0 22.5-27.5 25 


19.3 and from them the corresponding points of division on the raw-score 
scale. These mark off the raw-score ranges corresponding to all C scores. 

The normal-graphic procedure described in connection with T scaling can 
also be applied here; in fact, it is even more convenient in this connection and 
is to be recommended in preference to steps 6 and 7. Since the centile ranks 
are marked on probability paper (see Fig. 19.4), one would locate the centile- 
rank limits (column 3 of Table 19.5) and from the plot, usually a straight line, 
find the corresponding raw-score division points. 

An Evaluation of the C Scale. The C scale has many of the advantages of 
the T'scale, It refers obtained scores to a common scale that is related to the 
normal distribution. If the population distribution on a measured trait is 
normal, then the distribution of C scores properly represents that population 
and the units of measurement may be regarded as equal. It lacks the refine- 
ment of a small unit such as that provided by the 7 scale. On the other hand, 
it probably more nearly represents the accuracy of discrimination actually 
made by means of tests, and its broader categories will do for guidance 
purposes. 

"There is a handicap in selection of personnel in that a change of minimum 
qualifying: score of only one C-scale unit may result in quite a difference in 


ча 
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percentage of cases selected. For example, if the cutoff score were changed 
from 5 to 6, 20 per cent more rejections would have to be made. For selection 
purposes, however, raw-score cutoffs may be just as feasible as derived scores. 
The reference of any chosen raw-score cutoff to equivalent C-score limits or 
centiles would add meaning to that particular value. 

For guidance and counseling purposes, the use of a zero C score may be 
unwise. Unless he is more sophisticated than most people, a counselee would 
hardly relish being told that he earned a score of zero, To meet this con- 
tingency, one could let the scores range from 1 through 11 instead of 0 through 
10. Or one could resort to a condensed scale to be described next. 

The Stanine Scale. "There are several reasons for condensing the C scale 
to some extent by giving it a nine-unit range. "This is usually done by com- 
bining the two categories at either end, with 4 per cent of the distribution in 
categories 1 and 9. Such a scale was standard for the Army Air Force 
Aviation Psychology Program during World War II. All test scores and 
composites were eventually scaled to this system, called "stanine" as a 
contraction of “standard nine." The mean of such a norm distribution 
would be 5.0, as in the C scale, but the standard deviation would be slightly 
lower—1.96—because of the contractions at the tails of the curve. 

Perhaps the chief practical benefit to be derived from nine units rather than 
11 is that such scores occupy only one column on the IBM punched-card 
records, For research purposes, however, a significant grouping error (see 
Chap. 5) is thus introduced, calling for corrections of various sorts when pre- 
cise statistics are wanted. In guidance work, many counselors would proba- 
bly not like to have the rare one person in a hundred at either extreme sub- 
merged with the other 3 per cent next to him, There is probably a full unit's 
discrimination between the hundredth person and the next 3 per cent just as 
there is between any other neighboring categories, This loss of discrimina- 
tion in the stanine scale may not be tolerated and is unnecessary in the use 
of profiles in guidance. 


Some NorM AND PROFILE SUGGESTIONS 


Suggestions were made in Chap. 6 concerning the derivation of centile 
norms and the construction of profiles. Here we are ready for other, more 
comprehensive suggestions. There will be shown a profile chart, in which 
raw scores can be interpreted in terms of the C scale, T scale, or centile rank. 

A Profile Chart with Three Interpretive Scales. Figure 19.6 shows an 
example of a profile chart by means of which raw scores on several tests may 
be readily translated into C-scale, T-scale, or centile equivalents. The seven 
tests are the parts of the Guilford-Zimmerman Aptitude Survey. 

Such a chart is most conveniently prepared by using a plot of the cumula- 
tive distribution on probability paper, as described earlier in this chapter. 
In the chart, the spacing of centile ranks is made to conform to the spacings 
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WAR Ae 
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t a З Norms for College Меп 
© — о Ports of the Survey 


Fic, 19.6. A profile chart for the seven parts of the Guilford-Zimmerman Aptitude Survey, 
based on norms for college men, The key to the part names is as follows: VC = Verbal 
Comprehension; GR = General Reasoning; NO = Numerical Operations; PS = Per- 
ceptual Speed; SO = Spatial Orientation; SV = Spatial Visualization; MK = Mechanical 
Knowledge, 


— 
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of T and C scales, whose units are at equal intervals. The location of the 
raw scores for each test is made to conform to the appropriate centile levels as 
read from the plot on probability paper. As many of the raw-score integers 
are included as space will permit. 


Exercises 


1. a. Determine the standard scores for the two students in Data 194. 
b. Give a rank order to each student in the five tests, first in terms of raw scores, then 
in terms of standard scores. Explain discrepancies in rank order. 


Data 194. MEANS AND STANDARD DEVIATIONS IN FIVE PARTS OF AN ENGINEERING- 
APTITUDE EXAMINATION AND SCORES OF Two STUDENTS 


Paper Form 


Sylloge folding | perception 


28 33 26 

8 5 7 
30 17 35 
15 32 41 


2. a. Derive a conversion equation for transforming scores in the syllogism test into a 
scale that would give a mean of 50 and a SD of 10. 
b. Using the equation, determine the scores for students A and B on the new scale. 
3. Determine the equivalent T scores for the upper-category limits of the form-perception 
scores in Data 19B. 


DATA 19B. FREQUENCY DISTRIBUTION OF SCORES FOR ENGINEERING FRESHMEN 
IN THE FORM-PERCEPTION TEST 


Scores Frequencies 
40-44 2 
35-39 16 
E 30-34 42 
25-29 52 
20-24 55 
15-16 26 
10-14 13 
5-9 ТЕЕ | 
2 207 


4. By a graphic smoothing process, find a modified set of equivalent T scores for the same 
category limits. 

5. Using the results of Exercise 4, find equivalent T scores for the following raw scores 
in the form-perception test:8 12 16 22 37 42. 

6. Determine for the form-perception test the exact score limits (to one decimal place) 
corresponding to the C-score categories. Use a smoothing process, on regular or probability 
graph paper. 

7. Determine C-score equivalents for the six raw scores listed in Exercise 5. 

8. Through the relationship of either T scores or of C scores to centiles, determine the 
centile equivalents to the raw scores listed in Exercise 5. 
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Answers 


1. a. А: +1.50; +1.83; +0.25; —3.20; +1.29. 
1.75, +2.83; —1.62; —0.20; +2.14. 
2. a. X, = 1.25Х, + 15. 
b. X.: 52.5; 33.75. 
3, T: 73.3; 63.6; 55.5; 48.9; 41.3; 35.1; 24.2. 
4. T: (19); 72; 64; 56; 49; 41; 34; 26. 
5. T scores: 23; 30; 36; 45; 68; 75. 
6. C-score limits: 39.9; 36.2; 32.8; 29.6; 26.4; 23.2; 19.9; 16.7; 13.3; 9.7. 
7. C scores: 0; 1; 2; 4; 9; 10. 
8. Centiles: 0,5; 2.5; 9.0; 33.5; 97.0; 99.6, 


APPENDIX A 


Some SELECTED MATHEMATICAL PROOFS AND DERIVATIONS 


A List of Brief Titles 


. Effect upon a mean of adding a constant 
. Effect upon a mean of multiplying by a constant 
The mean of a simple linear function 
. Effect upon the standard deviation of adding a constant 
. Effect upon the standard deviation of multiplying by a constant 
. The standard deviation of a simple linear function 
Variances and standard deviations in combined frequencies 
Derivation of the formula for the point-biserial ғ 
. Derivation of the phi coefficient from rg 
10. Regression coefficients in a two-variable linear equation 
11. The mean of a sum of measures 
12. The variance and standard deviation in a sum of measures 
13. The correlation of sums 
14, Linear transformation equation 
In this Appendix are presented a few of the derivations or proofs of equations. Selection 
has been determined by several considerations: (1) Because of their relative simplicity the 
proofs can be followed by most students; (2) the proofs are illustrative of the manner in 
which formulas in general are derived; (3) the proofs should help to give insight on some 
fundamental statistical concepts; and (4) the proofs are not commonly found elsewhere, 
Footnote references in the preceding chapters often indicate sources of derivations of other 
formulas. 


1. The effect upon a mean of adding a constant to every observed value 


Let X = any observed value in a set of méasurements 
C = a constant value added to every X 
M, * arithmetic mean of all the X values 
Misse) = arithmetic mean of all values (X + C) 
N = number of observations in the sample 


Then Hag 23040" 


D ш ж UNa 


-м,+с (А.1) 


* In these equations and those following throughout this Appendix, the summation sign is given 
without showing the range over which summation is made, Strictly speaking, ZX should be written. 


here as 

N 

> x 

D 
to show that the N values of the sample are included. The omission makes for easier reading. particu. 
larly where formulas become complicated. Itis believed that in all instances the range of summation 
will be clear, if not directly from the formula, at least from the contest. 
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In other words, the mean of X values, each augmented by the addition of a constant C, 
{в equal to the mean of the X’s plus the same constant. C may have a negative value as 
well as a positive one. 


2. The effect upon a mean of multiplying each observed value by a constant 


Let Mox = arithmetic mean of all values C X X, and other symbols be defined as in 
1 above. 
„EZX 
N 
= СМ, (А.2) 


In other words, the mean of X values all multiplied by the same constant is equal to the 
mean of those values times the constant. 


3. The mean of a linear function of a value 


Let the linear function of X be the regression equation Y^ = a + bX (see Chap. 15). 
We want to find the mean Map. Here we have a combination of a product of a con- 
stant times X, namely, (6X), and also a constant increment (a). 


2. 
M Ems EE AIDE 
A (A3) 


In other words, the mean of a linear function of X is that same function of the mean of X. 
This principle is useful in connection with regression equations in general. 


4. Effect upon the standard deviation of adding a constant to each observed value 


Using the same symbols as above, with the addition of: 
а» = standard deviation of the X values 
ose = standard deviation of all values (X + C) 
x = a deviation of X from M, 
(ye) = deviation of (X -+ C) from the mean (M, + C) 


We find that 
2049 = (X + C) — (f, + C) 
X M. 
-z 
From this it follows that 
T = DA 
Piste) = o's 
and fus = 9. (АА) 


In other words, adding a constant to every observed value has no effect upon the standard 
deviation. 
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5. Effect upon the standard deviation of multiplying each observed value by a constant, C 
Let с, = standard deviation of the products CX. From (A.2) above, 


М. = СМ, 
‘Therefore, Zes = CX CAM. 
= С(Х — м.) 
- Cs 
au = Cue 
= Cot, (А.5) 


Taking square roots of both sides of (А.5) 
96 = Со, (A.) 


6. Standard deviation of a linear function of X 


1 the function of X is a + bX, the mean of this function, from (A.3) above, is equal to 
а + 6M,. Each deviation of this function (V) from its mean is, therefore, 


Jax) = (a + bX) — (a b.) 
= bX - bM, 
= b(X — М,) 
= bx 


From (А.б), we deduce that ey, = bez. Therefore, 
вы) = bos (АЛ) 


Thus, wherever we use a simple regression equation of the form Y" = a 4- bX, the 
standard deviation of Y" equals bos. 


7. Variances and standard deviations of combined distributions 


Assume two sample distributions 4 and B, whose frequencies are summed to form a 
total distribution T. 
Let Ma, Mi, and М, = means of distributions A, B, and T, respectively 
Ma, i, and N = numbers of cases in corresponding distributions 
Xa, Xy, and X, = measures in the three distributions, respectively 
Za, хь, and x, = deviations of measures from the means of their respective 
distributions 


Жы and хы = deviations of measures in distributions A and B, respectively 
from M, 
d, and dy = deviations of means of distributions A and B, respectively, 
from M, 
From the preceding, 
de= M,— M, and d. = M М, (A.8) 
Transposing, 
M,-M,—d, and . = Mi — 4, (А.9) 


By definition given above, and from (А.9) and (A.8), 


зы = X, —Mi = X. — Ma + da m 2. + f. 
and хы = X, — И, = X— Ae ＋ d. nd 
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Squaring both sides of these equations, 


tat = (ta + da)? = 1a + d?a + Jade 
and toe = (a + 0)? = х% + dy + 2nd, 


Summing for all measures in either distribution, 


хай, = Eats + nada, + 20,2, 
and Ул, = Erh + пыйъ + 24ь®хь 


Now both Dre and Zz, equal zero, which eliminates the last terms from the last two equa- 
tions. The sum of squares in the total distribution is the combination of Хд. and Erhi; 
in other words, 

Za = Ха, + nad + ®х% + mdh (А.10а) 


Or, Бу combining terms, 
Exh = (®х% + Erh) + (пой? + md’) (A.10b) 


This proof has involved the combination of only two sample distributions. It can readily 
be generalized to include any number of samples, by adding, by analogy, additional equa- 
tions in each step taken above. 


8. Formula for the point-biserial coefficient of correlation, rot 


Let X be a continuous variable, continuously measured. 
Y be a genuine dichotomy, with point values of 0 and +1. 
The cases in the favored category have values of +1. 
N = total number of cases 
Np = number of cases in the favored category (Np = pN) 
Na = number of cases in the other category (Vg = gN. Np + Na = N) 
М, = arithmetic mean of the X values 
oz = standard deviation of the X values 
М» = mean of the X values in the favored category on Y 
M, = mean of the X values for the remaining category 
p = proportion of the cases in the favored category (р = N,/N) 
q = 1 — $;qalso equals V/ 
М, = mean of the point values in variable Y. It can be shown to equal р (see 


"Table 9.3). 
су = standard deviation in the point values. It can be shown to equal \/ pq (see 
Table 9.3). 


The point-biserial r is a product-moment correlation coefficient. There are several 
ways of deriving the formula for тры. Let us start with the basic formula for the Pearson r, 


"e Zxy 
v^ Nee 


where x = X — Mz and y = Y — My. 
Therefore, 


(A.11) 


Zxy = У(Х — Mj(Y — My) 
= ZXY — M,ZX — M.ZY + NM-M, (A.12) 


Substituting NM, for 2X and NM, for ZY in (A.12), 


U 


Ezy = EXY — NM.M, — NM-M, + VI. , 


ZXY — NM-M, (A.13) 
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Substituting (A.13) in (A.11), 
_ ZXY — NM.M, 


Tuz Ni 
Making some other substitutions, 


XF = N,M, NM-M, = ХМ,ф = N,M. and 
_ NoMy — NM, 
Noz V bg 
Dividing numerator and denominator of (A.15) by N, 
rye = PM = PMs _ (My — M3» 
о: V PY az V p4 


Dividing numerator and denominator of (A.16) by 4/7, 


785 №= M. ve 


we get Туз 


оу = 


КЛ 


(А.15) 


(A.16) 


(A17) 


This is one form of the equation for the point-biserial z. If we want the form involving 


M, rather than Му, some further proof is required. 


М. = pM, qM, 


so that M, — Mz = My — ФМ, – qM, 
= (1 Р)М, — M, 
= 4M, — qM, 
= q(M, Мә 


Substituting (А.18) in (A.17), 
(My — М) М 
0; 


fpi 


9. Derivation of the formula for phi from туы 


(A.18) 


(A.19) 


Phi is a product-moment correlation in a 2 X 2 contingency table where both variables 
are genuine dichotomies and the distributions are point distributions, with values of +1 
and 0. Let the symbols used be defined in the two following tables, one based upon fre- 


quencies and the other upon corresponding proportions. 


FREQUENCIES PROPORTIONS 


* 
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Substituting these values in (A.19), we have 


б) vn 
emeni Vee 
„ 
Now 55 g Л 


And since ? = a + B and g = у + à, the right side of (А.21) becomes 
aly + 8) — (а + B) _ av +d — ay — By _ ad — By 
24 А pa by 
Substituting (A.22) in (A.20), 
(08 — By) М 
tav pt 
„= 
МР: 
10, Regression coefficients in a two-variable linear equation 
Let the general regression equation for a straight line be 
Y'=at+ bx 


$ 


(A.20) 


(А21) 


(А22) 


(А23) 


Problem: To find for any set of data involving corresponding X and Y those values of 


a and b which will make Х(У — J a minimum. 
We first set up an equation involving the expression (Y — Y"): 


(Y—Y)-Y—-a-bX 
Squaring both sides, we have an expression for the discrepancy squared: 


(Y— Y)? = (Y o — bX} 
= Yt +a DN? 24 — 2XY + 2abX 


Summing for all observations, 
Z(Y — Y)! = EY? + Na! 4- ЗУХ? 20 F 20ZXY --2abZX 
The partial derivatives of (A.24) are 


азо ry = 2Na — 227 + УХ 


A - - 22K + 202X 


Setting derivative (A. 25) equal to zero, we have 


2Na – 27Ү + WX = 0 
ог Ма – ХУ + DX = 0 
Transposing, we have 


Na УХ = 2Y 
Setting derivative (A.26) equal to zero, we have 


2bzX* — 2EXY T 2ахХ =0 
or DX ZXT — azX =0 


(A.24) 


(A.25) 


(A.26) 


(A.27) 
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Transposing, we have 
aZX-HFbZX*-rXY (A.28) 


(A.27) and (A.28) provide us with two mormal equations which, solved simultaneously, 
give us formulas for deriving a and b from the observations X and Y. Dividing (A.27) 
by N, we have 

РЕТ en EY 


N 
a+ M. = M, 
Transposing, а= M, — Mj (A.29) 


Substituting (A.29) in (A.28), we have 
(ZX)M, — (ZX)M,b + (ZX = ZXY 
Collecting terms and transposing, 


(ZX — (XY) = ZXY — (ZX)M, 
е IXY - (ZX)M, 
Solving for b, b- и (А.30) 


Multiplying numerator and denominator by N, 


NEXY — (2X)(2¥) (АЗ) 


b= FIX (2x)? 


11. The mean of a sum of measurements 


a. For equally weighted measurements: 

Let X, and Xa be two independently derived measures of the same individual. Let 
X; and X4 be summed for each individual, giving a composite measure X, + Xs. The 
problem is to find the mean of the composite, M(2,+2,)- 

, у(х. + X 
M Gee) = zona x) 


_ 2Х‹+ ХХ, 
N 


= М. + М, (А.32) 


where M; = mean of X; values and M; = mean of X; values. н Д 
For the general case, in which there аге n measurements of each individual, it can be 


similarly shown that 


Mounts) m Mit Mat i MS (A.33) 


If we let the symbols M, = mean of an unweighted sum of n measures as M, = the 
mean of any one of the measures Xi to X, inclusive, we may write equation (A.33) in 
more economical form as 

M, = ZM; (A.34) 


In other words, when measures are summed without weighting, the mean of the sums is 
equal to the sum of the means. 
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b. For differentially weighted measurements: 
When the measurements Xi and X; are weighted by multipliers w: and we, respectively, 
M, Z(wiXi + wX) 
(алтат) = M CEN T 
i e z 


N 
ХХ , 10 Ха 
SON EN 
= Му + wM: 
To describe the general case, with n measurements, 
Моз т . . ns = M + alf: + +++ + „M, (A.35) 


If Mus symbolizes the mean of a weighted composite, and M; symbolizes the mean of 
any one measurement that enters into it, we may write equation (A.35) in abbreviated 


form: 
Мы = ЖМ; (A.36) 


12. Variance and standard deviation of a sum 


a, When measurements are equally weighted: 

Let X; and Хз be two independently derived measures of the same individual, summed 
without weighting to obtain a composite measure, The variance of the composite measures 
is given by the equation 

2 
оо Zt a (A.37) 
where (xı + ха) = a deviation of (Xi + X2) from Mentee.“ Expanding the binomial 
in (A.37), 
Z(xh + 89 + 22153) 


orten = N 
Za, Zah „Ула 
VW ta +27 (A.38) 


The most meaningful interpretation to make of (A.38) in this development is to say 
that the first term on the right of the equality sign is the variance in X, the second term 
is the variance in Хз, and the third term is twice the covariance between X; and X». It 
will be helpful, next, to relate the covariance term to the correlation between X; and Хз. 
By the Pearson product-moment formula, 


Exita 
ONTAN (A.39) 
Multiplying both sides of (A.39) by соз, 
>: 
nanos = С (А40) 
Substituting 02, с?з, and riz0102 in (A.38), we have 
Pere) = 0% + 0% + 27120102 (A.41) 


* The deviation of a composite of two values from the mean of the composite equals x1 + x2, for 


(Ха + X) — (Mı + M) = (Xr — M) + (Xa — M) = * + n 
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Taking square roots of both sides of (A.41), 


Sn = Vo% + о, + 27127102 (A.42) 


In other words, the variance of an unweighted sum of two measures is equal to the sum 
of the variances of the components plus two times their covariance. To generalize to any 
number of unweighted components, and remembering that we shall have as many covari- 
ance terms as there are pairs of components, 


re pege ess tn) =O Боч + +++ Бо, + r 
E recae 


Let o, = variance of an unweighted sum of any number of measures and 
о% = variance of any measure from 1 to u, inclusive. 


Then . = Zo; + 2Zrijoio; (where i < j) (A.43) 
By square roots, the standard deviation of a sum is given by 
o, = V o; F Tree; (here i <j (A.44) 


b. When measurements are differentially weighted: 
Let the weights to be applied to Xi, Xs, . . . , Xn be wi, ws, . . . , Wa, respectively. 
For the variance of the sum of two weighted measurements: 
з Z(wixi + wera)? 
Ses, e — 
= Zwar + 103333, + 21011052128) 
N 


02150 J 2 T2 Узаг 
== + N + 2wiwe N 


Making substitutions similar to those made in (A.38), 
шүгүр) = 005 + Waata + 2712101050102 (A.45) 


In other words, the variance of a weighted sum of two measures equals the sum of the 
component variances, each weighted by its weight squared, plus twice the covariance multi- 
plied by the product of the weights. The standard deviation, by taking square roots, is 


абозу) = V wher, + 320; + 27 12tb 1b: (A.46) 
Generalized to include » components and to apply the symbols as defined in (A.43), 
сы = У ио; + Z Tree; (Where 1 <j (A. 47) 


13. Correlation of sums 
a. Correlation between one variable, C, and an unweighted sum of two other variables, 
Xi and Хз: 
Applying the Pearson product-moment formula to this problem, 
Zc(xi + 23) 
Now 2,41) 
eri Xo. (A.48) 
Nor ap 


Tei = 
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Now Усад = Nraee and cx = Nro 02. 
Substituting these values in (A.48), we have 


Малого + Nros, 


Тука) = Nose) 


Eliminating No., and expanding the standard deviation of the sum, 


Tag + fea 
C A. 40) 
Tett ee e : 


Let т, = correlation of the sum of n unweighted measures with C 
X; = any variable from 1 to u, inclusive 
Tei = correlation of C with any variable 1 to n 
X; = any variable with a greater subscript number than X; 
Extended to the general case, (A.49) becomes 


Preisi 


Rr — where 1 <j) A,50) 
та Tei + 20:0; ( 1 8 : 


b. Correlation of one variable, C, with the sum of differentially weighted variables: 
Let wi, Wa, . . „ Wn weights be applied to measures Xj, Xs; . . . , Xn, respectively. 
For the sum of two variables, by Pearson’s formula, 


Be(wixs + ware) 
Noce шүл a) 
LU Ee + a сха 

N 


rebellen, = 


Making substitutions as in (A. 48) above, 


алгас + Nurasa, 


etui) = 
AT Noo (ose py) 


Eliminating No, and expanding the standard deviation of the weighted sum, 


1017101 + 1037203 


LL A.51) 
* — Vitor, + who's + 2 frre es í 
Generalizing to any number of weighted components, 
Lwireios 
. here i <j .52 
pe V Zuie?; + 2.7: : (Ре =) igi 


с. Correlation of two unweighted composites: 

Without presenting the proof, which is quite analogous to those just presented, two 
formulas will be given here for the correlation of two composite measures from information 
about correlations among the components. 

Let X; and X; be any two measures in the first composite, Ci, und X, and X, be any two 
measures in the second composite, С. By analogy to (A.50) and (A.52), the following 
equations apply. (A.53) is for two unweighted composites, and (A.54) for weighted 
composites. (A. 54) reduces to (A.53) if all weights are +1. 
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Eoi Хуса) 


= here i < j and T 
fai = Ve F anme Ve + Lares F 


Z(wiei Brinton) 


Dede tee V ee F 2 eee 
(where i j and w < v) (A. 54) 


Vueiues 


14. Linear transformation of values in one distribution to corresponding standard-score posi- 
lions in another 
Problem: Given a distribution of observed values, to find a linear equation which will 
determine for each value one that deviates as much in terms of standard-deviation units 
from the mean in another distribution of similar values and in the same direction. 
Let X, — a value in distribution 4 
Ma = mean of values in distribution A 
oq = standard deviation in distribution A 
X, — a value in distribution B 
M, = mean of values in distribution B 
дь = standard deviation in distribution B 
Mie = а value in distribution A equivalent to one in distribution B, where equiva- 
lence is as defined above 
Assume, as the problem statement requires, that standard measures or deviates in the 
two distributions are equal. In equation form, 


Xu — Ma _ № – М (A.55) 
Ta 9» 
Multiplying (A.55) by ca, 
Xwa — Mwoa 


Хь — Ma = = 
-@®=@= 
00 оъ 
Transposing, Nb c (= X- Ө М + Ma 


= (=) U) и.) (A.56) 
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TABLES 
A List of Brief Titles 


A. Squares and square roots of numbers 1 to 1,000 

B. Proportions of area under the normal distribution curve 

C. Standard scores and ordinates corresponding to areas under the normal curve 
D. Significant coefficients of correlation and / ratios 

E. Chi square 

F. F ratio 

G. Functions of 2, 9, 2, and y 

Н, Fisher’s z for different values of r 

J. Trigonometric functions 

K. Four-place logarithms of numbers 

L. Significance of rank-difference correlations 
M. Values for estimation of the cosine-pi coeflicient of correlation 
N, Significant chi squares in small samples 

O. Binomial distributions 

P. Significant T values for ranked differences 

Q. Significant R values for sums of ranks 
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TABLE A. SQUARES AND SQUARE Roots or NuMBERS FROM 1 то 1,000* 


Number Square Square root 


6.4031 


© O су\л Ne 


*FromSorenson. Statistics for Students of Psychology and Education. New York: McGraw-Hill. 
1936. 


* 
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TABLE A. Squares AND Square Roors or Numpers FROM 1 то 1,000* (Continued) 


Number Square Square root 
81 11.0000 
82 11.0454 
83 11.0905 
84 11.1355 
85 11.1803 
86 11.2250 
87 11.2694 
88 11.3137 
89 11.3578 
90 11.4018 
91 11.4455 
92 11.4891 
93 11.5326 
94 11.5758 
95 11.6190 
96 11.6619 
97 11.7047 
98 11.7473 
99 11.7898 
100 11.8322 
101 11.8743 
102 11.9164 
103 11.9583 
104 12.0000 
105 12.0416 
106 12.0830 
107 12.1244 
108 12.1655 
109 12.2066 
по 12.2474 
n 12.2882 
n2 12.3288 
113 12.3693 
14 12.4097 
ns 12.4499 
116 12.4900 
n 12.5300 
118 12.5698. 
19 12.6095 
120 12.6491 


Ex From Sorenson. Statistics for Students of Psychology and Education. New York: McGraw-Hill, 
1930. 


r 
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Tant A. SQUARES AND SQUARE Roors or Numers уком 1 то 1,000* (Continued) 


Square root 


161 14.1774 
162, 14.2127 
163 12.7671 14.2478 
164 12.8062 14.2829 
165 12.8452 14.3178 
166 12.8841 14.3527 
167 12,9228 14.3875 
168 12.9615 14.4222 
169 13.0000 14.4568 
170 13.0384 "| 14.4914 
171 13.0767 14.5258 
172 13.1149 14.5602 
173 13.1529 14.5945 
174 13.1909 14.6287 
175 13.2288 14.6629 
176 13.2665 14.6969 
177 13.3041 14.7309 
178 13.3417 14.7648 
179 13,3791 14.7986 
180 13.4164, 14,8324 
181 13.4536 14.8661 
182 13.4907 14,8997 
183 13.5277 14.9332 
184 13.5647 14.9666 
185 13.6015 15.0000 
186 13.6382 15.0333 
187 13.6748 15.0665 
188 13,7113 15,0997 
189 13,7477 15.1327 
190 13.7840 15.1658 
191 13.8203 15.1987 
192 13.8564 15,2315 
193 13.8924 15, 2643 
194 13.9284 15.2971 
195 13.9642 15.3297 
196 14.0000 15.3623 
197 14.0357 15.3948 
198 M.0712 15,4272 
19 14.1067 15.4596 
200 14.1421 15.4919 


* 
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TABLE А. SQUARES AND SQUARE Roors or NUMBERS FROM 1 то 1,000* (Continued) 


Number Square Square root Square root 
241 580 81 15.5242 16.7631 
22 585 64 15.5563 16.7929 
23 59049 15.5885 16.8226 
244 595 36 15.6205 16.8523 
245 60025 15.6525 16.8819 
246 605 16 15.6844 16.9115 
247 61009 15.7162 16.9411 
248 61504 15.7480 16.9706 
249 62001 15.7797 17.0000 
250 ` 62500 15.8114 17.0294 
251 63001 15.8430 17.0587 
252 635 04 15.8745 17.0880 
253 640 09 15.9060 17.1172 
254 645 16 15.9374 17.1464 
255 650 25 15.9687 17.1756 
256 655 36 16.0000 17.2047 
257 6 60 49 16.0312 17.2337 
258 665 64 16.0624 17.2627 
259 670 81 16.0935 17.2916 
260 676 00 16.1245 17.3205 
261 68121 16.1555 17.3494 
262 686 44 16.1864 17.3781 
263 691 69 16.2173 17.4069 
264 696 96 16.2481 17.4356 
265 702 25 16.2788 17.4642 
266 707 56 16.3095 17.4929 
267 71289 16.3401 17.5214 
268 71824 16.3707 17.5499 
269 723 61 16.4012 17.5784 
270 729 00 16.4317 17.6068 
271 73441 16.4621 17.6352 
272 739 84 16.4924 17.6635 
273 745 29 16.5227 17.6918 
274 7 50 76 16.5529 17.7200 
275 75625 16.5831 17.7482 
276 7 61 76 16.6132 17.7764 
277 76729 16.6433 17.8045 
278 77284 16.6733 17.8326 
279 77841 16.7033 17.8606 
280 78400 16.7332 17.8885 


* From Sorenson. 
1936. 


Statistics for Students of Psychology and Education. New York: McGraw-Hill, 


fili EUM 
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TABLE A. SQUARES AND SQUARE Roots or NUMBERS FROM 1 TO 1,000* (Continued) 


Number Square Square root | Number Square Square root 
321 1030 41 17.9165 361 13 03 21 19.0000 
322 10 36 84 17.9444 362 13 10 44 19.0263 
323 10 43 29 17.9722 363 13 17 69 19.0526 
324 10 49 76 18.0000 364 13 24 96 19.0788 
325 10 56 25 18.0278 365 13 3225 19.1050 
326 1062 76 18.0555 366 13 39 56 19.1311 
327 10 69 29 18.0831 367 13 46 89 19.1572 
328 1075 84 18.1108 368 13 5424 19.1833 
329 10 82 41 18.1384 369 13 61 61 19.2094 
330 10 89 00 18.1659 370 13 69 00 19.2354 
331 1095 61 18.1934 371 137641 19.2614 
332 1102 24 18.2209 372 13 83 84 19.2873 
333 1108 89 18.2483 373 13 9129 19.3132 
334 1115 56 18.2757 374 13 98 76 19.3391 

| 335 112225 18.3030 375 14 06 25 19.3649 
336 1128 96 18.3303 376 14 13 76 19.3907 
337 1135 69 18.3576 377 142129 19.4165 
338 114244 18.3848 378 1428 84 19.4422 
339 114921 18.4120 379 1436 41 19.4679 
340 11 5600 18.4391 380 14 44 00 19.4936 
341 116281 18.4662 381 14 5161 19.5192 
342 11 69 64 18.4932 382 14 5924 19.5448 
343 117649 18.5203 383 14 66 89 19.5704 
344 11 83 36 18.5472 384 14 74 56 19.5959 
345 119025 18.5742 385 148225 19.6214 
346 1197 16 18.6011 386 14 89 96 19.6469 
347 12 04 09 18.6279 387 14 97 69 19.6723 
348 121104 18.6548 388 15 05 44 19.6977 
349 121801 18.6815 389 151321 19.7231 
350 122500 18.7083 390 152100 19.7484 
351 123201 18.7350 391 1528 81 19.7737 
352 12 39 04 18.7617 392 15 36 64 19.7990 
353 12 46 09 18.7883 393 1544 49 19.8242 
354 1253 16 18.8149 394 15 52 36 19.8494 
355 126025 18.8414 395 15 60 25 19,8746 
356 12 67 36 18.8680 396 15 68 16 19.8997 
357 12 74 49 18.8944 397 15 76 09 19.9249 
358 128164 18.9209 398 158404 19.9499 
359 12 88 81 18.9473 399 159201 19.9750 
360 12 96 00 18.9737 400 16 00 00 20.0000 


* From Sorenson. Statistics for Students of Psychology and Education, New York: McGraw-Hill, 
1936. 
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TABLE A, SQUARES AND SQUARE Roots or NUMBERS FROM 1 TO 1,000* (Continued) 


401 1608 01 20.0250 441 19 44 81 21.0000 
402 161604 20.0499 442 195364 21.0238 
403 1624 09 20.0749 443 19 62 49 721.0476 
404 16 32 16 20.0998 444 1971 36 21.0713 
405 164025 20.1246 445 19 80 25 21.0950 
406 16 48 36 20.1494 446 19 89 16 21.1187 
407 16 56 49 20.1742 447 19 98 09 21.1424 
408 1664 6+ 20.1990 448 20 07 04 21.1660 
409 167281 20.2237 449 20 1601 21.1896 
410 168100 20.2485 450 20 25 00 21.2132 
Alt 168921 20.2731 451 20 3401 21.2368 
412 16 97 44 20.2978 452 20 43 04 21.2603 
413 17 05 69 20.3224 453 20 5209 21.2838 
4M 1713 96 20.3470 454 20 61 16 21.3073 
45 172225 20.3715 455 20 7025 21.3307 
416 17 30 56 20.3961 456 20 79 36 21.3542 
417 17 38 89 20.4206 457 20 88 49 21.3776 
418 1747 24 20.4450 458 2097 64 21.4009 
419 17 55 61 20.4695 459 21 06 81 21.4243 
420 176400 20.4939 460 21 1600 21.4476 
421 177241 20.5183 461 212521 21.4709 
422 17 80 84 20.5426 462 213444 21.4942 
423 17 89 29 20.5670 463 21 43 69 21.5174 
424 1797 76 20.5913 464 21 52 96 21.5407 
425 18 06 25 20.6155 465 216225 21.5639 
426 1814 76 20.6398 466 217156 21.5870 
427 1823 29 20.6640 467 21 80 89 21.6102 
428 1831 84 20.6882 468 21 90 24 21.6333 
429 18 40 41 20.7123 469 219961 21.6564 
430 18 49 00 20.7364 470 220900 21.6795 
431 18 57 61 20.7605 71 221841 21.7025 
432 18 66 24 20.7846 472 22 27 84 21.7256 
433 18 74 89 20.8087 473 22 3729 21.7486 
434 18 83 56 20.8327 474 22 46 76 21.7715 
435 18 92 25 20.8567 475 22 5625 21.7945 
436 19 00 96 20.8806 476 22 65 76 21.8174 
437 1909 69 20.9045 477 22 7529 21.8403 
438 191844 20.9284 478 228484 21.8632 
439 192721 20.9523 479 229441 21.8861 
440 1936 00 20.9762 480 23 0400 21.9089 


* Prom Sorenson. Statistics for Students of Psychology amd Education. New York: McGraw-Hill, 
1936, 
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TABLE A. SQUARES AND SQUARE Roots or Nusmers FROM 1 то 1,000* (Continued) 


2313 61 
232324 
23 32 89 
23 42 56 
23 52 25 
23 6196 
237169 
23 8144 
239121 
240100 


24 10 81 
2420 64 
24 30 49 
24 40 36 
24 5025 
24 60 16 
24 70 09 
24 8004 
24 90 01 
25 00 00 


25 1001 
25 20 04 
25 3009 
25 40 16 
25 5025 
25 60 36 
25 70 49 
25 80 64 
25 90 81 
26 01 00 


261121 
262144 
263169 
2641 96 
26 52 25 
26 62 56 
26 72 89 
26 83 24 
26 93 61 
27 04 00 


21.9317 
21.9545 
21.9773 
22.0000 
22.0227 
22.0454 
22.0681 
22.0907 
22.1133 
22.1359 


22.1585 
22.1811 
22.2036 
22.2261 
22.2486 
22.2711 
22.2935 
22.3159 
22.3383 
22.3607 


22.3830 


22.5832 


22.6053 
22.6174 
22.6495 
22.6716 
22.6936 
22.7156 
22.7376 
22.7596 
22.7816 
22.8035 


Square Square root 


27 1441 
27 2484 
27 35 29 
27 45 76 
27 5625 
27 66 76 
27 77 29 
27 87 84 
27 98 41 
28 09 00 


28 19 61 
28 30 24 
28 40 89 
28 51 56 
28 62 25 
2872 96 
28 83 69 
28 94 44 
2905 21 
29 16 00 


29 26 81 
29 37 64 
29 48 49 
29 59 36 
29 70 25 
29 81 16 
299209 
30 03 04 
301401 
30 25 00 


303601 
304704 
30 58 09 
30 69 16 
30 80 25 
309136 
3102 49 
3113 64 
3124 81 
313600 


22 


22. 
22. 
2. 
22. 
22. 
22. 
22. 
23. 
23. 


23. 
23. 
23. 
23. 
23. 
23. 
23; 
1948 
2164 
«2379 


.8254 


8473 
8692 
8910 
9129 
9347 
9565 
9783 
0000 
0217 


0434 
0651 
0868 
1084 
1301 
1517 
1733 


* From Sorenson, Statisties for Students of Psychology and Education. New York: McGraw-Hill, 


1936. 
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TABLE А. SQUARES AND SQUARE Roots or NUMBERS FROM 1 то 1,000* (Continued) 


Number Square Square root | Number Square Square root 
561 3147 21 23.6854 601 361201 24.5153 
562 315844 23.7065 602 36 24 04 24.5357 
563 3169 69 23.7276 603 36 36 09 24.5561 
564 31 80 96 23.7487 604 3648 16 24.5764 
565 319225 23.7697 605 36 60 25 24.5967 
566 3203 56 23.7908 606 36 72 36 24.6171 
567 3214 89 23.8118 607 36 84 49 24.6374 
568 322624 23.8328 608 36 96 64 24.6577 
569 3237 61 23.8537 609 37 08 81 24.6779 
570 3249 00 23.8747 610 372100 24.6982 
571 32 60 41 23.8956 611 373321 24.7184 
572 327184 23.9165 612 37 45 44 24.7385 
573 32 83 29 23.9374 613 37 57 69 24.7588 
574 3294 76 23.9583 614 37 69 96 24.7790 
575 33 06 25 23.9792 615 37 82 25 24.7992 
576 33 17 76 24.0000 616 37 94 56 24.8193 
577 33 29 29 24.0208 617 38 06 89 24.8395 
578 33 40 84 24.0416 618 38 19 24 24.8596 
579 33 52 41 24.0624 619 38 31 61 24.8797 
580 33 64 00 24.0832 620 38 44 00 24.8998 
581 33 75 61 24.1039 621 38 56 41 24.9199 
582 33 87 24 24.1247 622 38 68 84 24.9399 
583 33 98 89 24.1454 623 38 81 29 24.9600 
584 34 10 56 24.1661 624 38 93 76 24.9800 
585 342225 24.1868 625 39 06 25 25.0000 
586 34 33 96 24.2074 626 39 18 76 25.0200 
587 3445 69 24.2281 627 39 31 29 25.0400 
588 3457 44 24.2487 628 39 43 84 25.0599 
589 346921 24.2693 629 39 56 41 25.0799 
590 34 81 00 24.2899 630 39 69 00 25.0998 
591 3492 81 24.3105. 631 39 81 61 25.1197 
592 3504 64 24.3311 632 39 94 24 25.1396 
593 351649 24.3516 633 40 06 89 25.1595 
594 35 28 36 24.3721 634 40 19 56 25.1794 
595 35 40 25 24.3926 635 40 32 25 25.1992 
596 35 52 16 24.4131 636 40 44 96 25.2190 
597 35 64 09 24.4336 637 40 57 69 25.2389 
598 35 7604 24.4540 638 40 70 44 25.2587 
599 35 8801 24.4745 639 40 83 21 25.2784 
600 36 00 00 24.4949 640 40 96 00 25.2982 


————————— Él 
From Sorenson, Statistics for Students of Psychology and Education, New York: McGraw-Hill, 


1936. 
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TABLE А. SQUARES AND Square Roots ОР NUMBERS FROM 1 TO 1,000* (Continued) 


Number Square Square root 
641 41 08 81 25.3180 
642 41 21 64 25.3377 
643 41 34 49 25.3574 
644 41 47 36 25.3772 
645 41 6025 25.3969 
646 4173 16 25.4165 
647 41 86 09 25.4362 
648 4199 04 25.4558 
649 421201 25.4755 
650 42 25 00 25.4951 
651 42 3801 25.5147 
652 42 5104 25.5343 
653 42 64 09 25.5539 
654 427716 25.5734 
655 42 90 25 25.5930 
656 43 03 36 25.6125 
657 43 16 49 25.6320 
658 43 29 64 25.6515 
659 43 42 81 25.6710 
660 43 56 00 25.6905 
661 43 69 21 25.7099 
662 43 8244 25.7294 
663 43 95 69 25.7488 
664 44 08 96 25.7682 
665 44 22 25 25.7876 
666 44 35 56 25.8070 
667 44 48 89 25.8263 
668 44 6224 25.8457 
669 44 75 61 25.8650 
670 44 89 00 25.8844 
671 45 02 41 25.9037 
672 451584 25.9230 
673 45 29 29 25.9422 
674 45 42 76 25.9615 
675 45 56 25 25.9808 
676 45 69 76 26.0000 
677 45 83 29 26.0192 
678 45 96 84 26.0384 
679 46 1041 26.0576 


* From Sorenson. 
1936. 


46 24 00 


26.0768 


Square 


46 37 61 
46 51 24 
46 64 89 
46 78 56 
469225 
47 05 96 
47 19 69 
47 33 44 
474721 
47 61 00 


477481 
47 88 64 
48 02 49 
48 16 36 
48 30 25 
48 44 16 
48 58 09 
48 7204 
48 86 01 
49 00 00 


49 1401 
49 28 04 
49 42 09 
49 56 16 
49 7025 
49 84 36 
49 98 49 
501264 
50 26 81 
504100 


505521 
50 69 44 
50 83 69 
50 97 96 
511225 
5126 56 
514089 
515524 
5169 61 
51 8400 


Square root 


26.0960 
26.1151 
26.1343 
26.1534 
26.1725 
26.1916 
26.2107 
26.2298 
26.2488 
26.2679 


26.2869 
26.3059 
26.3249 
26.3439 
26.3629 
26.3818 
26.4008 
26.4197 
26.4386 
26.4575 


26.4764 
26.4953 
26.5141 
26.5330 
26.5518 
26.5707 
26.5895 
26.6083 
26.6271 
26.6458 


26.6646 
26.6833 
26.7021 
26.7208 
26.7395 
26.7582 
26.7769 
26.7955 
26.8142 
26.8328 


Statistics for Students of Psychology and Education. New York: McGraw-Hill, 
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TABLE A. SQUARES AND SQUARE Roots or NUMBERS FROM 1 то 1,000* (Continued) 


Number Square Square root Square 
721 5198 41 26.8514 761 5791 21 
722 521284 26.8701 762 58 06 44 
723 5227 29 26.8887 763 58 21 69 
724 524176 26.9072 764 58 36 96 
725 52 5625 26.9258 765 58 52 25 
726 52 70 76 26.9444 766 58 67 56 
727 52 85 29 26.9629 767 58 82 89 
728 52 99 84 26.9815 768 58 98 24 
729 531441 27.0000 769 59 13 61 
730 53 29 00 27.0185 770 59 29 00 
731 53 43 61 27.0370 771 59 44 41 
732 53 58 24 27.0555 772 59 59 84 
733 53 72 89 27.0740 773 59 75 29 
734 53 87 56 27.0924 774 59 90 76 
735 54 02 25 27.1109 775 60 06 25 
736 54 16 96 27.1293 776 60 21 76 
737 5431 69 27.1477 777 603729 
738 54 46 44 27.1662 778 60 52 84 
739 54 6127 27.1846 | 779 60 68 41 
740 54 76 00 27.2029 780 60 84 00 
741 54 90 81 27.2213 781 60 99 61 
742 55 05 64 27.2397 782 61 15 24 
743 55 20 49 27.2580 783 61 30 89 
744 5535 36 27.2764 784 61 46 56 
745 55 50 25 27.2947 | 785 61 62 25 
746 55 65 16 27.3130 786 6177 96 
747 55 80 09 27.3313 787 6193 69 
748 5595 04 27.3496 788 62 09 44 
749 56 10 01 27.3679 789 62 25 21 
750 5625 00 27.3861 790 62 41 00 
751 56 40 01 27.4044 791 62 56 81 
752 565504 27.4226 | 792 62 72 64 
753 56 70 09 27.4408 | 793 62 88 49 
754 56 85 16 27.4591 | 794 63 0+ 36 
755 57 0025 27.4773 795 63 20 25 
756 57 15 36 27.4955 796 63 36 16 
757 57 30 49 27.5136 797 63 5209 
758 57 45 64 27.5318 | 798 63 68 04 
759 57 60 81 27.5500 | 799 63 84 01 
760 57 76 00 27.5681 800 64 00 00 


* From Sorenson. 
1936. 
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New York: McGraw-Hill, 


Square root 


27.5862 
27.6043 
27.6225 
27.6405 
27.6586 
21.6767 
27.6948 
27.7128 
27.7308 
27.7489 


27.7669 
27.7849 
27.8029 
27.8209 
27.8388 
27.8568 
27.8747 
27.8927 
27.9106 
27.9285 


27.9464 | 
27.9643 
27.9821 
28.0000 
28.0179 
28.0357 
28.0535 
28.0713 
28.0891 
28.1069 


28.1247 
28.1425 
28.1603 
28.1780 
28.1957 
28.2135 
28.2312 
28.2489 
28.2666 
28.2843 
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TABLE A. SQUARES AND SQUARE Roots or NUMBERS From 1 TO 1,000* (Continued) 


Square Square root Square Square root 


64 1601 28.3019 70 72 81 29.0000 
64 3204 28.3196 70 89 64 29.0172 
64 48 09 28.3373 710649 29.0345 
64 64 16 28.3049 7123 36 29.0517 
64 80 25 28.3725 71 40 25 29.0689 
64 96 36 28.3901 7157 16 29.0861 
65 12 49 28.4077 717409 29.1033 
65 28 64 28.4253 719104 29.1204 
6544 81 28.4429 720801 29.1376 
65 61 00 28.4605 722500 29.1548 


65 77 21 28.4781 724201 29.1719 
28.4956 72 59 04 29.1890 
28.5132 72 76 09 29.2062 
28.5307 729316 29.2233 
28.5482 731025 29.2404 
28.5657 73 27 36 29.2575 
28.5832 73 44 49 29.2746 
28.6007 73 61 64 29.2916 
28.6082 73 78 81 29.3087 
28.6356 73 96 00 29.3258 


28.6531 741321 29.3428 
28.6705 743044 29.3598 
28.6880 7447 69 29.3769 
28.7054 74 64 96 29,3939 
28.7228 74 82 25 29.4109 
28.7402 7499 56 29.4279 
28.7576 75 16 89 29.4449 
28.7750 753424 29.4618 
28.7924 755161 29.4788 
28.8097 75 69 00 29.4958 


69 05 61 28.8271 758641 29,5127 
692224 28.8444 7603 84 29.5296 
69 38 89 28.8617 762129 29.5466 
69 55 56 28.8791 76 38 76 29.5635 
69 72 25 28.8964 76 56 25 29.5804 
69 88 96 28.9137 7673 76 29.5973 
70 05 69 28.9310 769129 29.6142 
70 22 44 28.9482 77 08 84 29.6311 


70 39 21 28.9655 77 26 41 29.6479 
70 56 00 28.9828 77 44 00 29.6648 


_-_—_———————— 
* From Sorenson. Statistics for Students of Psychology and Education, New York: McGraw-Hill, 
1936. 
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TABLE А. SQUARES AND Square Roots or М№омвевѕ FROM 1 TO 1,000* (Continued) 


Number Square Square root | Number Square Square root 
881 77 61 61 29.6816 921 84 82 41 30.3480 
882 77 79 24 29.6985 922 85 00 84 30.3645 
883 77 96 89 29.7153 923 85 19 29 30.3809 
884 78 14 56 29.7321 924 85 37 76 30.3974 
885 783225 29.7489 925 85 56 25 30.4138 
886 78 49 96 29.7658 926 85 74 76 30.4302 
887 78 67 69 29.7825 927 85 93 29 30.4467 
888 78 85 44 29.7993 928 86 11 84 30.4631 
889 79 03 21 29.8161 929 86 30 41 30.4795 
890 79 21 00 29.8329 930 86 49 00 30.4959 
891 79 38 81 29.8496 931 86 67 61 30.5123 
892 79 56 64 29.8664 932 86 86 24 30.5287 
893 79 74 49 29.8831 933 87 04 89 30.5450 
894 79 92 36 29.8998 934 87 23 56 30.5614 
895 80 10 25 29.9166 935 87 42 25 30.5778 
896 8028 16 29.9333 936 87 60 96 30.5941 
897 80 46 09 29.9500 937 87 79 69 30.6105 
898 80 64 04 29.9666 938 87 98 44 30.6268 
899 80 82 01 29.9833 939 881721 30.6431 
900 81 00 00 30.0000 940 88 36 00 30.6594 
901 811801 30.0167 941 88 54 81 30.6757 
902 813604 30.0333 942 88 73 64 30.6920 
903 81 54 09 30.0500 943 88 92 49 30.7083 
904 817216 30.0666 944 89 11 36 30.7246 
905 81 90 25 30.0832 945 89 30 25 30.7409 
906 82 08 36 30.0998 946 89 49 16 30.7571 
907 82 26 49 30.1164 947 89 68 09 30.7734 
908 82 44 64 30.1330 948 89 87 04 30.7896 
909 82 62 81 30.1496 949 90 06 01 30.8058 
910 82 81 00 30.1662 950 90 25 00 30.8221 
911 829921 30.1828 951 90 44 01 30.8383 
912 82 17 44 30.1993 952 90 63 04 30.8545 
913 83 35 69 30.2159 953 90 82 09 30.8707 
914 83 53 96 30.2324 954 9101 16 30.8869 
915 83 72 25 30.2490 955 912025 30.9031 
916 83 90 56 30.2655 956 91 39 36 30.9192 
917 84 08 89 30.2820 957 91 58 49 30.9354 
918 84 27 24 30.2985 958 9177 64 30.9515 
919 8445 61 30.3150 959 9196 81 30.9677 
920 84 64 00 30.3315 960 92 16 00 30.9839 


* From Sorenson. 


1936. 


Statistics for Students of Psychology and Education. New York: McGraw-Hill, 
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TABLE А. SQUARES AND SQUARE Roots or NUMBERS FROM 1 то 1,000* (Continued) 


Number Square Square root | Number Square Square root 
961 92 35 21 31.0000 981 96 23 61 31.3209 
962 92 54 44 31.0161 982 96 43 24 31.3369 
:963 92 73 69 31.0322 983 96 62 89 31.3528 
964 92 92 96 31.0483 984 96 82 56 31.3688 
965 93 12 25 31.0644 985 97 02 25 31.3847 
966 93 3156 31.0805 986 97 2196 31.4006 
967 93 50 89 31.0966 987 97 41 69 31.4166 
968 93 70 24 31.1127 988 97 6144 31.4325 
969 93 89 61 31.1288 989 97 8121 31.4484 
970 94 09 00 31.1448 990 98 01 00 31.4643 
971 9428 41 31.1609 991 98 20 81 31.4802 
972 94 47 84 31.1769 992 98 40 64 31.4960 
973 94 67 29 31.1929 993 98 60 49 31.5119 
974 94 86 76 31.2090 994 98 80 36 31.5278 
975 95 06 25 31.2250 995 99 00 25 31.5436 
976 * 95 25 76 31.2410 996 99 20 16 31.5595 
977 95 45 29 31.2570 997 99 40 09 31.5753 
978 95 64 84 31.2730 998 99 60 04 31.5911 
979 95 84 41 31.2890 999 99 80 01 31.6070 
980 96 04 00 31.3050 | 1000 100 00 00 31.6228 


* From Sorenson. Statistics for Students of Psychology and Education. New York: McGraw-Hill, 
1936. 


The Use of Tables B and C 


Tables B and C assume a normal distribution whose standard deviation is equal to 
1.00 and whose total area (or N) also equals 1.00. Under these conditions, there are 
fixed mathematical relationships between values on the base line (as measured in 
c units) and areas under the curve (A, B, and C) and also ordinate values (у). 

The use of Tables B and C is fully explained їп Chap. 7. Figures A.1, 4.2, B. l, and 
B.2 may help to relate the symbols to the normal curve. 

Table B is best used when we know a z and want to find a corresponding 4, B, or C 
area, or the ordinate y. Table C is best used when we know any one of the areas 
A, B, or C and want to find the corresponding z or у. In case any one of these areas 
is known, it can be readily used to find a corresponding area by means of the following 
relationships. 


A=B—.50 
A=50-C (A4+C=.50) 
B=A+.50 

B-100—C (B+C=1.00) 
6 80 — 4 

C-100—B 
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(0.5000 of 
the total 
area) 


(0.5000 of 
the total 
area) 
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TABLE B. PROPORTIONS OF THE AREA UNDER THE NORMAL DISTRIBUTION CURVE 
AND ORDINATES CORRESPONDING TO GIVEN STANDARD ScORES 


z A B [^ 5 
Standard Area from Area in Area in Ordinate 
score (z/c) . | mean to х/т | larger portion | smaller portion at z/o 
0.00 -0000 .5000 .5000 .3989 
0.05 .0199 .5199 .4801 .3984 
0.10 .0398 .5398 .4602 .3970 
0.15 .0596 .5596 .4404 .3945 
0.20 .0793 .5793 .4207 .3910 
0.25 .0987 .5987 .4013 .3867 
0.30 .1179 .6179 .3821 .3814 
0.35 .1368 .6368 .3632 .3752 
0.40 .1554 .6554 .3446 .3683 
0.45 .1736 .6736 .3264 .3605 
0.50 .1915 .6915 .3085 .3521 
0.55 .2088 .7088 .2912 .3429 
0.60 .2257 .7257 2743 .3332 
0.65 .2422 .7422 .2578 .3230 
0.70 .2580 .7580 .2420 .3123 
0.75 .2734 .7734 .2266 .3011 
0.80 .2881 .7881 .2119 .2897 
0.85 .3023 .8023 .1977 .2780 
0.90 .3159 .8159 .1841 .2661 
0.95 .3289 .8289 1711 2541 
1.00 3413 8413 1587 .2420 
1.05 .3531 8531 .1469 .2299 
1.10 .3643 .8643 .1357 .2179 
1.15 .3749 .8749 .1251 .2059 
1.20 .3849 .8849 .1151 .1942 
1.25 .3944 .8944 .1056 .1826 
1.30 .4032 .9032 .0968 1714 
1.35 .4115 .9115 .0885 ‚1604 
1.40 .4192 .9192 .0808 .1497 
1.45 .4265 .9265 .0735 .1394 
1.50 .4332 .9332 .0668 .1295 
` 1.55 4394 9394 .0606 .1200 
1.60 4452 +9452 .0548 .1109 
1.65 .4505 .9505 .0495 .1023 
1.70 4554 .9554 .0446 .0940 
— . — —ꝛ—y—e —— — 


o ча 


534 FUNDAMENTAL STATISTICS IN PSYCHOLOGY AND EDUCATION | 


TABLE B. PROPORTIONS OF THE AREA UNDER THE NORMAL DISTRIBUTION CURVE 
AND OmpiNATES CORRESPONDING TO GIVEN STANDARD Scores (Continued) 


А A B ra y 
Standard Area from Area in Area in Ordinate 
score (a/c) mean to х/т | larger portion smaller portion at x/e | 

1:75 .4599 .9599 .0401 .0863 
1.80 ‚4641 .9641 .0359 .0790 
1.85 .4678 .9678 .0322 .0721 
1.90 A713 .9713 .0287 .0656 
1.95 ..4744 .9744 .0256 .0596 
2.00 .4772 .9772 .0228 .0540 
2,05 4798 9798 ‚0202 0488 
2.10 4821 9821 0179 ‚0440 
2.15 .4842 .9842 .0158 .0396 
2.20 .4861 .9861 .0139 .0355 
2:25 4878 9878 .0122 0317 
2.30 4893 9893 ‚0107 .0283 
2.35 ‚4906 9900 ‚0094 ‚0252 
2.40 ‚4918 .9918 ‚0082 ‚0224 
2.45 ‚4929 .9929 .0071 .0198 
2,50 .4938 .9938 .0062 .0175 
2.55 4940 994 0054 .0154 
2.60 4953 .9953 .0047 .0136 
2.65 4960 .9960 .0040 0119 ! 
2,70 4965 9965 .0035 0104 
2,80 4974 9974 .0026 0079 
2.90 4981 9981 .0019 0060 
3.00 49865 .99865 00135 0044 
3.10 49903 99903 .00097 0033 
3.20 49931 .99931 .00069 0024 
3.40 49966 .99966 00034 0012 
3.60 49984 .99984 00016 00061 
3,80 499928 .999928 000072 00029 
4.00 4999683 .9999683 0000317 00013 
4.50 4999966 .9999966 0000034 000015 
5.00 49999971 ‚99999971 00000029 0000015 
6.00 499999999 ‚999999999 000000001 000000006; 
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TABLE C. STANDARD Scores (oR DEvIATES) AND ORDINATES CORRESPONDING TO 
Divisions OF THE AREA UNDER THE NORMAL CURVE INTO A LARGER PROPORTION 


(B) AND A SMALLER Proportion (C); ALSO THE VALUE VH 
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Taste C. STANDARD SCORES (oR DEVIATES) AND ORDINATES CORRESPONDING TO 
DIVISIONS OF THE AREA UNDER THE NORMAL CURVE INTO A LARGER PROPORTION 


(B) AND A SMALLER Proportion (C); ALsO THE VALUE V (Continued) 


EN | 7 
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Taste C. STANDARD SCORES (ок DeviaTes) AND ORDINATES CORRESPONDING TO 
DIVISIONS OF THE AREA UNDER ТИВ NORMAL CURVE INTO A LARGER Proportion 


(B) лхо A SMALLER Proportion (C); Also тик Улик УВС (Continued) 


.050 
.045 
„ою 
035 
.030 
975 1.9600 ons 
.980 2.0537 -020 
:985 2.1701 -015 
.990 2.3263 -010 
.995 2.8758 .005 
:996 2.6521 004 
.997 2.7478 ооз 
998 2.8782 002 
999 3.0902 @о! 
‚9995 3.2905 -0005 
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Taste D. COEFFICIENTS OF CORRELATION AND } RATIOS SIGNIFICANT AT THE .05 
LEVEL (RoMAN TYPE) AND AT THE .01 LeveL (Borp-racEp ТҮРЕ) 
FOR VARYING DEGREES OF FREEDOM* 


Number of variables 


E 


1.000] 1.000] 1.000| 1.000 
1.000] 1.000] 1,000) 1. 


+994! 
„999 


«979 
„993 


961 
2984) 


: 88 83 80 


23 
x 
e 


Pew an so on BS 
in 2 
Ч z 
el 


* Adapted from Wallace, Н. A., and Snedecor, G. W. Correlation and Machine Calculation, 
1931, by courtesy of the authors. 
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TABLE D. COEFFICIENTS OF CORRELATION AND I RATIOS SIGNIFICANT AT THE .05 
Levert (ROMAN TYPE) AND AT THE .01 LeveL (BOLD-FACED ТурЕ) 
FOR VARYING DEGREES OF FREEDOM* (Continued) 


of 
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df; degrees of freedom (for greater variance) 
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TABLE F.“ .05 (Roman Турк) AND .01 (Воір-ғАСЕр ТҮРЕ) POINTS rog THE DISTRIBUTION OF F 


6.69/15.98/15.52/15.31| 


ag Б Б ЗЕ 


6.61] 5.79) 5.41] 5. 


198.49199.01/99.17|99.25, 
16.26/13.27/12.06/11. 


|21.20/18.00/1. 


о о 29 
Д 
ne $$ Z9 8 


4 “g 9 Фо жа 


4 | 7.71| 6.94 6.59| 6.39] 6.26, 


2 |18.51/19.00/19.16/19.25. 
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or me Uxrr Normat DrsrgmUTION Curve (Continued) 
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Таві Н. CONVERSION ОР A PEARSON ғ INTO A CORRESPONDING 
Fisuxa's z Coxrricixsr* 


2 SS. -85 1,26 | .950 1.83 

|a 56. .86 1.29 .955 1.89 

45 57 87 1.33 900 1.95 

46 .58 88 1.38 | .965 2.01 

47 59 .89 1.42 .970 2.09 
30, 31 AB 60 90 1.47 | 7s 2.18 
юм 32 50 ól .905 1.50 | .980 2,30 
32 38 51 62 910 1,53 | 985 2.44 
33 34 52 63 915 1.56 | .990 2.65. 
4 385 54 64 .920 1.59 | .995 2,99 

.55 65 928 1.62 

50 ‚6 930 1.66 

.58 67 938 1.70 

59 ‚68 940 1.74 

.60 69 1.78 


22 
* The values in this table were derived by interpolation from Ta 
Method for Research Workers and are published by permission of ti 
Edinburgh and London, 
Т Por all values of r below .2$, r = 1. 
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TABLE J. TRIGONOMETRIC FUNCTIONS 


88 


Li 
88 


2 88888 888 


BO BO E 


S892 38232 844343 32338 22235 33833 55595 337437 955555 


* From Smail, College Algebra. New York: McGraw-Hill, 1931. 
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Taste К. Foum-PLACE LOGARITHMS or NUMBERS* 


47 | 6721| 6730| 6730] 
4 68: (8 


п|[1|а4|]3[4[5%|[в|т1]]|,», 


* Prom Smail, College Algebra. New York: McGraw-Hill, 1931. 


one OMOEA neee. She, Oxo -n S- eee 


Prop. Parts 
T T 
2.2 |2.1 
44 |4.2 
0.6 .3 
8.8 4 
11.0 10,5 
13.2 12.9 
15.4 |14.7 
17.6 [168 
19.8 |18 
908 
10 | 38 
HH 
10:0 9.5 
12, 11.4 
M, 13.3 
16, 15.2 
18,0 117,1 
18 | 17 
1.8 ki 
8.0 н 
5.4 |5, 
1 
HRA 10.2 
12. 1.9 
ic] 13.3 
0.2 116, 
16 15 
1.6 [1.5 
3.2 |30 
48 [4.5 
0.4 | 6.0 
8.0 gj 
9.0 0 
14.4 11.4 
4 з 
в | 20 
a 9 
.6 
0 
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eee оооло Ske 


pumas Domnceuae одета 
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TABLE К. Four-prace Locarrrams OF NUMBERS* (Continued) 


OMNIS е козю 
SO OUO AS 
оол еза o 


Sees 
Sees, 
—— 


(5010 oui coto 
$oeesMeeO, 
toco lo ho ei d 


(00-10 ou coto 
Sers 
Sossen 


OMNA t 
роон 
DELI 


9628 
9675 
9722] 
9768) 
9814 
9859) 
9903] 9 
9948) 
9991 


0035 


> 


(00-10 e 
PISO ДОЦ 
S отоо. 


* From Smail, College Algebra. New York: McGraw-Hill, 1931. 
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Taste L. VALUES OF RANK-DIFFERENCE COEFFICIENTS OF CORRELATION THAT 
ARE SIGNIFICANT AT THE .05 AND .01 Levets (OwE-rAIL TEST)“ 


— — — üg 
* Reproduced by permission from Dixon, W. J., and Massey, F. J., Ir. Introduction to Statistical 
Analysis. New York: McGraw-Hill, 1951. Table 17-6, p. 261. This table had been derived from 
Olds, E. G. The 5 per cent significance levels of sums of squares of rank differences and a correction. 
Ann. math. Statist., 1949, 20, 117-118. For a two-tail test, double the probabilities to .10 and. 02. 


) 
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TABLE M. VALUES TO FACILITATE THE ESTIMATION OF THE CosiNE-P1 COEFFICIENT 


OF CORRELATION, WITH TWO-PLACE ACCURACY* 


d Toos-pi a Teoa-pi Е Tcos-pt bo Tcos-pi 
1.013  ,005t| 1.940 255 4.067 505 11.512 755 
1.039 015 1.993 265 4.205 515 12.177 756 
1.066 025 2.048 275 4.351 525 12.906 778 
1.093 ‚035 2.105 285 4.503 5835 13.702 788 
1.122 045 2.164 295 4.662 545 14.592 795 
1.150 058 2.225 305 4.830 555 15.573 805 
1.180 068 2.288 315 5.007 565 16.670 815 
1.211 075 2.353 325 5.192 .575 17.900 825 
1.242 085 2.421 335 5.388 585 19.288 835 
1.275 095 2.490 345 5.595  .595 20.866 845 
1.308 105 2.563 355 5.813 605 22.675 855 
1:342 1185 2.638 365 6.043 615 24.768 865 
1.377 125 2.716 375 6.288 625 27.212 875 
1.413 135 2.797 385 6.547 635 30.106 885 
1.450 145 2.881 395 6.822 645 33.578 895 
1.488 155 2.957 405 7.115  .655 37.818 905 
1.528 .165 3.095 415 7.428 665 43.100 915 
1.568 175 3.153 425 7.761 675 49.851 925 
1.610 185 3.251 435 8.117 685 58.765 .935 
1.653 195 3.353 .445 8.499 .695 71.046 .945 
1.697 205 3.460 455 8.910 „705 88.984  .955 
1.743 17215 3.571 465 9.351 715 117.52 -965 
1:790) 225 3.690 475 9.828 725 169. 60 975 
1.838 238 3.808 485 10.344 735 293.28 .985 
1.888  .245 3.935 495 10.908 745 934.06 995 


Based upon a more detailed tabulation of the same values by Perry, N. C., Kettner, N. W., Hertzka, 
A. F., and Bouvier, E. A. Estimating the tetrachoric correlation coefficient via a cosine-pi table. 
Technical! Memorandum No.2. Los Angeles: University of Southern California, 1953. 

f Example: If an obtained ratio ad/bc equals 3.472, 
values of 3.460 and 3.571. The cosine: 


it 48.46, If bc is greater than ad, find 


we find that this value lies between tabled. 
-pi coefficient is therefore between ‚455 and .465; that is to Say, 


the ratio bc/ad and attach a negative sign to v.. 


5E 


p) 
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TABLE N. CELL FREQUENCIES REQUIRED TO ACHIEVE SIGNIFICANT CHI SQUARES 
AT THE .05 Pornr (ROMAN) AND AT THE .01 Pornt (Вогр-ғАсер) WHEN EACH 
Is PARALLEL TO THE SMALLEST CELL FREQUENCY IN A FOURFOLD TABLE* 

Smallest Cell Frequency 


N: |0 


1 2 3 


e 


10 11 12 13 14|15 16 17 18 1920 21 22 23 24/25 


— 
e 
D ta ош 


К< ЖК Ud Unc 


И Cd Gn cd On 


0-10 -1 со 0 0 0 
«© оо Ф со «© оо бо-: 
коок | 

©ооосо| œ 


«© 0-00 
— 
© 
— 
= 
m 
— 


«оф e 
e 
© 


11 12 13 


10 


11 


11 12 
12 — 
11 12 
13 13 
12:12 13 
13 14 14 
12 13 14 
13 14 15 


12 13 14 15 
14 14 15 16 
12 13 14 15 
14 16 16 16 
12 13 14 15 16) 
14 15 16 17 17 
12 14 14 15 16) 
14 15 16 17 18 


Instructions: This table was designed for use in 
comparing two groups of equal size (N; cases in 
each) with respect to their distributions in two 
categories in some other variable. For example, 
10 adult males and 10 adult females were asked 
whether they like to watch wrestling on television. 
Of the males, 8 said “Yes” and 2 “No”; of the 
females, 4 said “Yes” and 6 “No.” The smallest 
cell frequency is 2. Its parallel frequency is 6. In 
the row for N; — 10 and the column for 2, we find 
that it requires frequencies of 8 and 9 to be signifi- 
cant at the .05 and .01 points, respectively. The 
difference is therefore insignificant. Interpolations 
may be made between neighboring rows where 
necessary. 


5 
" 
6 
8 
6 
6 
6 


8 


7 91011 
9 11 12 13 
8 9.11 12 
10 12 13 15 
8 91112 
10 12 14 16) 
8 10 11 33 
10 12 14 15 


13 14 15 16 16| 
16 16 16 17 18 
13 15 16 17 18 
16 17 18 19 20 
14 15 16 18 19 
17 18 19 20 22 
14 15 17 18 19) 
17 18 20 21 22 


17 

19 

19 20 21 22 2324 

21 22 23 24 2026 

20 21 22 23 24025 26 27 28 2930 

23 24 25 26 27/28 29 30 31 3232 

20 22 23 24 25/26 27 28 29 30/31 32 33 34 35/36 
24 25 26 27 28/29 30 31 32 33034 35 36 37 3839 


N: |0 


123 


8 


10 11 12 13 14ʃ15 16 17 18 19120 21 22 23 24125 


* Adapted by permission from Mainland, D., and Murray, I. M. Tables for use in fourfold con- 


tingency tables. 


Science, 1952, 116, 591-594. 
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Taste О. FREQUENCIES IN BINOMIAL DISTRIBUTIONS DERIVED FROM EXPANSION 


OF THE Expression (3$ + 14)", Wuere m VARIES FROM 1 THROUGH 20, AND THE 


Sums or Frequencies, 2%* 


448 


= 
2 
2 
- 
a 
E] 
- 


pleted 


since each is symmetrical it can be readily com 


implete, but 


со 


* After n = 10 no distribution is 
where necessary from the frequencies given. 
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Ѕбіомттіслнт T VALUES AT тик .0S, .02, лмо .01 Levers ror DIFFERENT 
Nuuners or RANKED DIFFERENCES. T Is тик SuaLLER Sum оу RANXS 
Азвостлтко wirw DIFFERENCES ALL or тик Same Stan“ 


Sau wo 


— = 
7 


Reuse ESEUS 


Р) 
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TABLE Q. SIGNIFICANT R VALUES AT THE .05, 02. AND .01 LEVELS FOR DIFFERENT 
NuwsERs or N; Cases N Two SaMPLEs OF EQUAL Sizr. R Is THE SMALLER 
Sum or Ranxs* 


Р = .01 


N. |Р = .05 |Р = .02 
5 18 
6 27 
7 37 
8 49 
9 63 

10 79 
п 97 

12 16 
13 137 
14 160 
15 185 

16 212 
17 241 
18 271 

19 303 

20 338 


16 


152 
176 


202 
230 
259 
291 
324 


15 


282 
315 


— ы ы ae 
* Reproduced by permission from Wilcoxon, F. Some Rapid Approximate Statistical Procedures. 


Stamford, Conn.: American Cyanamid Co., 1949, 
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Page numbers in boldface type indicate bibliographical references 


Alienation, coefficient of, 375-377 
formula for, 375 
graphic chart for, 377 
table for, 376 


Analysis of variance, assumptions for, 281— 


283 
and correlation ratio, 203-294 
evaluation of, 281-283 
experimental design and, 283 x 
F ratio in, 261-262, 274, 280, 293 
in one-way classification, 258-267 
formulas for, 259-260, 265-267 
without replications, 278-280 
in two-way classification, 267-280 
formulas for, 271-273, 277-279 
Aristotle and classification, 12 
Arithmetic mean (see Mean, arithmetic) 
Army General Classification Test, data, 
97-99, 169 
Array defined, 150 
Attenuation, correction for, 476-478 
formulas for, 476-477 
limitations in, 478 
definition of, 476 
and factor theory, 476-477 
Attributes, prediction of, 381-382 
(See also Prediction) 
Average, definition of, 53 
running, 47 
(See also Mean) 
Average deviation, 78, 82-85 
interpretation of, 82-83 
relations to other statistics, 100 
use of, 99-100 


Baller, W. R., 229 
Bar diagram, 19-20 

for cerrelations, 148 

for distributions, 114-115 
Barlow's Tables, 10 


Bartlett, M. S., test of homogeneity of vari- 


ances, 242-244, 258, 263 
Baxter, B., 283 
experimental design, 283 


Beall, G., 246 
combined statistical tests, 245 
Belt graph, 22-23 
Berkson, J., 387 
cost and utility, 387 
Beta coefficient, 394, 406-408 
Binomial distribution, in combined sta- 
tistical tests, 245 
frequencies for, 552 
as statistical model, 205-206, 209 
Binomial expansion, 206-207, 213 
Biographical data, validity of, 484 
Biserial r, evaluation of, 300-301 
formulas for, 297, 299 
reliability of, 299-301 
Bouvier, E. A., 308, 550 
cosine-pi r, 308, 550 
Brogden, H. J., 432, 465 
personnel classification, 432 
reliability, 455 
Brown, C. W., 386 
selection chart, 386 


C scale, 501-503 
norms, 503-505 
Cantril, H., 234 
Categories, prediction of, 346-356 
qualitative and quantitative, 14 
Cattell, R. B., 464 
factor methods, 464 * 
Cell-square contingency, definition, 232 
in test of prediction, 337-338 
Centile, graphic estimate of, 110 
interpolated, 108 
Centile norms (see Norms) 
Centile point, 107 
integral, 111 
Centile ranks, 107 
spacing of, 112, 114 
Centiles corresponding to z scores, 130-131 
Central value defined, 53 
Change statistical significance of, 198-200 
Cheshire, L., 308 
tetrachoric r, 308 
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Chi square, 228-247 
and coefficient of contingency, 305 
in combined statistical tests, 245 
in contingency tables, 229-239 
definition of, 228-229 
distribution, 233 
formulas for, 232, 236-237, 240, 245 
in median test, 250 
and phi coefficient, 312 
and / ratio, 234 
table of, 234, 540, 551 
as test, of independence, 229-232 
of normality, 123, 240-242 
of predictions, 338-340 
in two-cell table, 237 
in two-by-two table, 236 
Class interval, 35-38 
Classification, of data, 12-24 
of personnel, 431—433 
qualitative versus quantitative, 14 
Cobb, M. V., 310 
Cochran, W. G., 221, 282, 283 
analysis of variance, 282 
experimental design, 221 
t ratio, 221 
Code method, for computing mean, 56-58 
for computing standard deviation, 91-93 
for correlation, 141, 144 
Cohen, B. H., 245 
combined statistical tests, 245 
Coin-tossing data, 119-120 
Column diagram (histogram), 38-42 
Column-square contingency, 337 
Communality defined, 465 
Composite-rank test, 252-254 
Composite scores, 415-426 
correlations of, 425-426 
means of, 416—417 
reliability of, 472-473 
standard deviations of, 417-421 
Computation, rules for, 31-32 
Confidence interval, 166-168 
Confidence levels, 167 
Confidence limits, 166-168 
Confounding, 269 
Conrad, H. S., 479 
index of forecasting efficiency, 479 
Contingency, cell-square, 337-338 
coefficient of, 315-316 
formula for, 315 
as index of prediction, 337-338 
limits to, 316 
mean-square, 315 
Continuity, correction for, 234 
Cooperative Test Service, 114 
Correlation, of averages, 324-325 
biserial, 297-301 
(See also Biserial r) 
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Correlation, code method of, 141-144 
coefficient of, 4 
definition of, 135 
and factor loading, 467 1 
as index of prediction, 375-379 
interpretation of, 145-148 
multiple, 397, 410 
origin of, 369-370 
standard error of, 178-180 
t ratio for, 219 
uses of, 145-146 
coefficients of, averages of, 325-326 
correlation between, 193-194 
differences between, significance of, 
193-194 
relativity of, 147, 178-183, 318 
of composites, 425-426 
computation of, 138-143 
corrected, for coarse grouping, 329-330 
“for restriction of range, 320-321 
cosine-pi, 307-308 
definition of, 135-136 
diagram, 362-363 
evaluation of, 317-318 
graphic representation of, 148-149 
in heterogeneous samples, 322-325 
as index of goodness of fit, 295 
intraclass, 280-281 
item-test, 453-454 
of means, 187 
multiple, corrected for bias, 399 
and factor theory, 469-470 
formulas for, 392-393, 397, 409 
point-biserial, 433 
principles of, 401-404 
shrinkage in, 399, 412 
in small samples, 398-399 
negative, 137, 139-140 
part-remainder, 327 
part-whole, 326-327 
partial, 316-318 
standard error of, 318 
phi coefficient, 311-315 
(See also Phi coefficient) 
point-biserial, 301-305 
(See also Point-biserial r) 
product-moment, assumptions under- 
lying, 149-150 
formulas, 138-143 
between proportions, 191 
rank-difference (see Rank-difference cor- 
relation) 
in restricted range, 320-321 
serial, 301 
between and within subsamples, 324-325 
of sums, 425-426 
proof for, 515-516 
tetrachoric (see Tetrachoric r) 


Correlation, variability and, 318-322 
Correlation index, 328 
Correlation ratio, 150, 288-297 
and analysis of variance, 293-294 
evaluation of, 294-297 
formulas for, 290 
Cosine-pi coefficient, formulas for, 307 
Cost and utility in personnel selection, 387 
Covariance, definition of, 417 
of items, 449-450 
and test reliability, 450 
Cox, G. M., 221, 288 
experimental design, 283 
t ratio, 221 
Critical ratio defined, 185 
Critical-score point, 341-356 
formulas for, 346-347, 353 
for genuine dichotomy, 350-356 
graphic determination of, 343-346, Ex 
Cronbach, L. J., 442 
reliability, 442 
Cureton, E. E., 246 
chi square, 246 
Cutoff, double, 387-388 
Cutoff method, evaluation of, 428-429 
multiple, 426-429 
and regression method, 427-428 
variations of, 429 
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Dailey, J. T., 472 
reliability, 472 
Darwin, C., regression phenomena, 368 
and statistics, 1 
Data, categorical, 12-13 
enumeration, 11-24 
enumerative, 333 
graphic representation of, 19-24 
kinds of, 11 
metric, 11, 24-33 
and statistics, 11-12 
tabulation of, 18-19 
Decibel scale, 73 
Decile, 107 
Decile scale, 110-111 
Degrees of freedom, in analysis of variance, 
259-260, 273-274 
for chi square, 232-233, 241, 243, 245 
in contingency tables, 232 
for correlation coefficient, 181 
partial, 318 
multiple, 399 
definition of, 163 
for F test, 224-225, 259-260, 273-274, 
294, 400 
for standard deviation, 163-164 
for 1 test, 218, 220-221 
Deming, W. E., curve fitting, 297 
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Determination, coefficient of, 375-379, 446 
diagram for, 377 
multiple, 397 
table for, 376 
Deviation, definition of, 66 
formula for, 82 
root-mean-square, 85 
statistical significance of, 213 
Difference, composite-rank test of, 252-254 
between correlated proportions, signifi- 
cance of, 222-223 
between correlation coefficients, signifi- 
cance of, 193-194 
between frequencies, significance of, 
190-192 
between means, significance of, 185-189 
4 ratio for, 220 
median test of, 249-251 
between percentages, significance of, 190- 
192 
between proportions, chi-square test of, 
239-240 
significance of, 190-192 
s ratio for, 221-222 
sign-rank test of, 251-252 
sign test of, 248-249 
between standard deviations, signifi- 
cance of, 192, 224-225 
Discontinuity, correction for, 212 
test for, 222 
Discriminant function, 432-433 
Distribution, bimodal, 451 
cumulative, 104-110 
and item intercorrelation, 451-452 
mesokurtic, 217 
normalized, 499 
platykurtic, 452 
rectangular, 111, 451 
sampling, definition of, 160 
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Doolittle н, 406-408 

Dressel, P. L., 456 
reliability, 456 

DuBois, P. H., 28 

Dunlap, J. W., 10, 299 
biserial r, 299 


Edwards, A. L., 288 
analysis of variance, 283 

Enumerative data, 333 
(See also Data) 

Error, score component, 436 
term, in F ratio, 262-263 
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Errors, of prediction, measurement of, 359— 
360 
in statistical inference, 215-217 
Types I and II, 216 
Eta coefficient, 290 
(See also Correlation ratio) 
Experiment, exploratory, 203 
group matching in, 198-199 
hypotheses in, 203 
Experimental design and analysis of vari- 
ance, 283 
Extrasensory perception, 204-205 


F distribution, 224-225 
F ratio, in analysis of variance, 261-262, 
274, 280, 293 
definition of, 224 
relation to ¢ ratio, 264 
F test, for difference betw en multiple R's, 
400 


following Bartlett's test, 244 
Factor, loading, definition of, 466-467 
as validity coefficient, 462 
saturation, 467 
Factor theory, 464-470 
of attenuation, 476-477 
basic assumptions, 464—465 
theorems, 464-467 
Factors, weighting of, 469-470 
Fechner’s law, 73 
Festinger, L., 221 
test, 221 
Fisher, R. A., 540, 545 
discriminant function, 432 
1 formulas, 219-221 
t ratio, 217 
table of chi square, 540 
z coefficient, 182-183 
in averaging r’s, 325-326 
table of, 545 
Flanagan, J» C., 489 
score scales, 489 
Forecasting efficiency, index of, 375-378 
diagram for, 378 
formulas for, 377, 479 
in multiple prediction, 398, 410 
table for, 376 
for true criterion, 478-479 
in predicting attributes, 335-336 
Frequencies, differences between, signifi- 
cance of, 190-192 
obtained versus expected, 123 
Frequency, cell, 142 
cumulative, 104 
definition of, 14 
expected, computation of, 231 
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Frequency, reliability of, 178 
Frequency distribution, 35 
graphic representation of, 38-43 
Frequency polygon, 38-41 
Fruchter, B., 464, 483 
factor methods, 464 
wrongs scores, 483 
Function fluctuation, of individuals, 448 
of tests, 448 


Gallup, G., public-opinion poll, 213 
Galton, F., correlation coefficient, origin of, 
369-370 
regression phenomenon, 368-369 
and statistics, 1 
Gauss, C. F., normal curve, 116-120 
Geometric mean, 53, 72-74 
formulas for, 72 
Ghiselli, E. E., 386 
selection chart, 386 
Goodfellow, L. D., 210 
human probability, 210 
Goodness of fit, correlation and, 295 
Gordon, M. H., 246 
chi square, 246 
Gosset, W. S., t ratio, 217 
Grant, D. A., 209 
statistical tests, 209 
Graphic methods, 19-24, 38-48, 114, 148- 
149, 343-346, 352, 396, 497—498 
Grouping, coarse, 49-50 
error of, 95-97 
and correlation, 329-330 
and mode, 63 
and standard deviation, 95-97 
Guilford, J. P., 10, 74, 146, 280, 309, 327, 
331, 346, 384, 388, 405, 412, 445, 457, 
462, 475, 483 
analysis of ratings, 280 
Aptitude Survey, 152, 503-504 
biographical data, 484 
factor analysis, 412 
factor theory, 462 
factors, 462 
in wrongs scores, 483 
item-total correlation, 327 
item weights, 483 
point-biserial r, 331 
prediction of categories, 346, 353 
r estimated from phi, 331 
reliability, 445, 457 
selection by tests, 384-385 
test evaluation, 146, 406 
tetrachoric r, 309 
validation, 388 


Guilford, R. B., 462, 476 
factors, 462 
Guilford-Zimmerman Aptitude Survey, 
data, 152 
profile chart, 504 
Guttman, L., prediction, 340-341 


Harmonic mean, 53, 74 
formula for, 74 
Harrell, M. S., 97, 169 
Army General Classification Test data, 
97, 169 
Harrell, T. W., 97, 169 
Army General Classification Test data, 
97, 169 
Hartson, L. D., 892 
Hayes, S. P., 308, 309 
tetrachoric r, 308-309 
Henry, F. M., 178 9 
cluster sampling, 173 
Hertzka, A. F., 308, 550 
cosine-pi r, 308, 550 
Histogram, 38-42 
Homoscedasticity defined, 150 
Horst, P., 438 
differential prediction, 433 
Hoyt, C. J., 455 
reliability formula, 455 
Hull, C. L., 146, 451 
minimum useful validity, 146 
suggestibility measures, 451 
Hypothesis, concerning population mean, 
165-167 
test of, 203-217 
(See also Null hypothesis) 


Independence of observations, 172-173 

Index correlation, 328-329 

Index number defined, 17 

Interaction variance defined, 269 

Internal consistency (see Reliability) 

Interquartile range, 80 

TQ's, correlation of, 328-329 

Item, difficulty, and variance, 450 
discrimination value of, 473-474 
intercorrelation of, estimated, 454 

and range of difficulty, 450-451 

precision of, 474 
weights, 483-484 

Item-test correlation, 453-454 


Jarrett, R. F., 173, 386 
cluster sampling, 173 
Selection index, 385-386 
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Jaspen, N., 801 
serial correlation, 301 
Jenness, A. F., 866 
Johnson, P. O., 283 
analysis of variance, 283 
Jorgensen, A. P., 366 


Kelley, T. L., 10, 60, 146, 316, 412 
coefficient of contingency, 316 
minimum acceptable reliability, 146 
number of classes, 50 
regression weights, 412, 415 

Kendall, M. G., 288 
rank correlation, 288 
tau test, 254 

Kettner, N. W., 550 
cos-pi coefficient, 550. 

Kogan, L. S., 283 
experimental design, 283 

Kreuter, R. P., profile method, 430 

Kuder, G. F., 464 
reliability, 454-456 

Kuder-Richardson formulas, 454-456 

Kurtosis, 217-218 

Kurtz, A. K., 10 


Least squares, principle of, 358 
Leptokurtic distribution, 218 
Lev, J., 217, 235, 303 
exact probabilities, 235 
nonparametric tests, 254 
point-biserial r, 303 
statistical inference, 217 
Student's distribution, 246 
Lewis, D., 219, 226, 233, 297 
chi-square distribution, 233 
curve fitting, 297 
F distribution, 225 
1 distribution, 219 
Lindquist, E. F., 157, 283, 323, 489 
analysis of variance, 283 
random numbers, 157 
“within” correlation, 323, 324 
Linear transformation, equation for, 493 
proof for, 517 
Linearity, F test for, 294 
Literary Digest poll, 213-214 
Loveland, E. H., 246 
chi square, 246 
Lyon, T. C., 309 
tetrachoric r, 309 


McNemar, Q., 216, 222, 240, 313, 488 
chi-square test, 240 
difference between proportions, 222 
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McNemar, Q., hypothesis testing, 216 
phi coefficient, 313 
test scales, 488 
Mainland, D., 561 
chi-square table, 551 
Mann, H. B., 254 
Mann-Whitney U test, 253-254 
Manson, M. P., 75, 350 
alcoholic data, 75, 350 
Marks, E. S., 178, 488 
cluster sampling, 173 
test scales, 488 
Martin, G. B., 479 
index of forecasting efficiency, 479 
Massey, F. J., Jr., 549 
table of significant rho's, 549 
Matched samples, 190 
sampling statistics in, 195-199 
Mathematical functions, use of, 297 
Mathematics role in science, 203-204 
Maximum likelihood, principle of, 335 
Mean, arithmetic, 4, 53 
advantages of, 64-65 
formulas for, 54-58 
properties of, 65 
in skewed distributions, 67 
use of, 68-69 
of arithmetic means, 69-70 
of composites, 416-417 
desired, achievement of, 421-422 
geometric, 53, 72-74 
harmonic, 53, 74 
of linear function, 508 
of percentages, 71-72 
formula for, 71 
population, 165-167 
proofs regarding, 507-508 
of proportions, 71-72 
of sums, 416-417 
proofs of, 513-514 
of test item, 449 
weighted, «70-72 
of correlation coefficients, 326 
formulas for, 70, 221 
of proportions, 221 
Mean square defined, 259 
Means, of columns and rows, 363-364 
difference between, significance of, 185- 
189, 220-221 
Measurement, 24-29 
educational, 25-26 
psychological, 24-28 
rank-order, 26 
Median, 4, 53, 58-63 
formulas for, 60 
graphic estimation of, 105 
properties of, 65-66 
reliability of, 173 
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Median, in skewed distributions, 67 
use of, 68-69 
Median test, 249-251 
Mesokurtic distribution, 217 
Michael, W. B., 10, 303, 308, 331, 346, 384 
cosine-pi r, 308 
point-biserial r, 303, 331 
prediction of categories, 346, 353 
selection by tests, 384-385 
Midpoint, location of, 41 
Mode, 4, 53, 63-64 
formula for, 64 
in skewed distributions, 67 
use in prediction, 336 
Moment defined, 66 
Monotonic function defined, 294 
Moses, L. E., 264 
nonparametric tests, 254 
Mosier, C. I., 462, 473 
factors, 462 
reliability, 478 
Mueller, C. G., transformations, 248 
Murray, I. M., 651 
chi-square table, 551 


N required for significance, 213-215 
Nondetermination, coefficient of, 378 
Normal curve, as approximation to binomial 
distribution, 212-213 
areas under, 125-131 
points corresponding to, 129-131 
best-fitting, 121-122 
graphic, 123-124 
and coin tossing, 119-120 
equation, 120 
statistical constants of, 533-537, 543-544 
as statistical model, 211-213 
tables, 533-537 
Normal distribution, 6, 116-131 
assumptions of, 116-117 
chi-square test of, 240-242 
kurtosis of, 217-218 
and probability, 118-120 
in sampling, 160-161 
Normal equation, 406 
Normal-probability paper, 498 
Norms, centile, 108-109, 113, 503-505 
C-scale, 503-505 
T-scale, 503-505 
Null hypothesis, 180 
definition of, 204 
statistical model for, 204-205 
test of, 185-186 
Numbers, approximate, 29 
limits of, 28 
in measurement, 28-29 
rounding of, 29-30 


Numbers, rules regarding, 29-33 
significant digits in, 30 


Observations, paired, 189 
Ogive, 106-107 

smoothed, 109 
Olds, E. G., 549 

significant rho coefficients, 549 
One-tail test, 207-211, 246 
Origin in coding, 56 


Parameter, population, definition of, 155 
Part-whole correlation, 326-327 
Partial correlation, 316-318 
Pearson, K., correlation coefficient, 285, 369 
computation of, 138-144 
estimated from phi, 331 
formulas for, 138-141, 143, 370 
restriction of range, 319 
Peatman, J. G., 18 
rules for classification, 13 
Percentage, cumulative, 106 
as rate, 15-16 
use of, 42-43 
Percentages, differences between, signifi- 
cance of, 190-192 
Percentile (see Centile) 
Perry, N. C., 308, 308, 331, 660 
cosine-pi r, 308, 550. 
point-biserial r, 303, 331 
r estimated from phi, 331 
Peters, C. C., 9, 297 
corrections in r, 329-330 
correlation of correlation coefficients, 193 
correlation ratio, 207 
standard errors, special, 197 
Phi coefficient, 191, 311-315 
and chi square, 312 
derivation of, 511-512 
as estimate of Pearson r, 330-331 
evaluation of, 313-315 
formulas for, 311-312 
maximum, 314 
chart of, 315 
limits to, 313-315 
for responses, 339 
significance of, 313 
Pictograph, 24 
Pie diagram, 22 
Platykurtic distribution, 218 
Point-biserial r, 301-305 
derivation of, 510-511 
evaluation of, 303-305 
formulas for, 302-303 
limits to, 304 
relation to biserial r, 303-304 
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Point-biserial r, significance of, 302 
use of, 305 
Poll, public-opinion, 157-159, 213-215 
Population, definition of, 5, 155 
finite, 197 
mean of, 165-167 
Power, of one-tail versus two-tail test, a 
of statistical tests, 189, 217 
1 accuracy of, 333-334 
of attributes, from attributes, 334-340 
from measurements, 340-356 
tial, 432 


of measurements from attributes, 358- 
Írom measurements, 362-365 
multiple, 390-391, 395-396 
396 


graphic, 

Írom regression equations, 371-372 
and statistics, 6 
types of, 333 

Probability, average, 177 
combination of, 208-209 


combined, of, 245 
of error, 216-217 


model, in statistics, 203-207 

and normal distribution, 118-120 
and null hypothesis, 205 

as proportion, 17 


Proportion, cumulative, 106 

definition of, 16-17 

as a mean, 176-177 

reliability of, 175 

tions, difference between, signifi- 
cance of, 190-192 
Psychophysics, coefficient of variation in, 
0! 


geometric mean in, 72-74 


Quartile, 80-81 
definition of, 80 
graphic estimation, 105 


Range, effect of, on correlation, 318-322 
on reliability, 457-458 
on validity, 322 
as measure of variability, 78-79 
use of, 79, 99-100 
relation to standard deviation, 93 
Rank-difference correlation, 285-288 
evaluation of, 288 
interpretation of, 287-288 
statistical significance of, 288 


› 


562 FUNDAMENTAL STATISTICS IN PSYCHO! 


Rank order as measurement, 26 
Ratings, analysis of variance of, 278-280 
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са definition of, 146 


44-450 
and length of test, 455-459 
in parts of test scale, 441-442 
of ratings, 280-281, 459 
related to validity, 470-472 
of speed versus power tests, 447 
of statistics, 154-183 
of test battery, 472-473 
test-retest, 442-445 
theory of, 43542, 449-452 
varieties of, 442-445 


of tages, 170 


Incidental, 
— 

у of statistics in, 195-197 
principles лы) 


Scatter diagram, 141-144 

Science, mathematics in, 203-204 

Scoring, correction of, for guessing, 479-480 
formulas for, 479-482 


Selection ratio, 381-388 
favorable, 383-384 
Selection tests, effectiveness of, 379-388 


Semi inter. range, 78, 80-81 
formula for, 81 
relations to other statistics, 100 
reliability of, 174 
use of, 99-100 
Sequential analysis, 225-226 
Serial correlation, 301 
Shartle, C. L., 411 
Wherry-Doolittle method, 411 
Sheppard, W. F., correction, 96-97, 365 
in corfelation, 329 
Slgn-rank test, 251-252 
Sign test, 248-249 
Significance, of a deviation, 213 
and sample size, 213-214 
Significance levels, 215-216 Р 
Significance points, 209 
Significant digits, 30 
Significan 


Showness, 43-44, 67 
and correlation, 150-151 
and quartiles, 81 
and test difficulty, 117 
Smail, L. L., 646, 847 
table, of logarithms, 547-548 
ol trigonometric functions, $46 
Smallsample statistics, 217-226 
Snedecor, О. W., 638, 642 
table, of Р, 261, 274, 541-542 
of r and , 538-539 
Sorenson, H., 619 
numerical tables, 519-531 
Spearman, C., rank-difference correlation, 


ol an array, 303 

in combined distribution, 509-510 
of а composite, 417-421 
computational check for, 93 
desired, achievement of, 421 

of differences, 418-419 

of eror components, 457 
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Standard score, 489-494 
definition of, 121-122 
disadvantages of, 492-493 
formula for, 122, 490 
Stanine scale, 503 
definition of, 148 
Statistical inference, 161—162, 215-217 
errors in, 216 
Statistical models, 203-207 
Statistical test, power of, 217 
Statistical tests, combinations of, 244-247 
Statistics, aims of students in, 7-9 
and data, 11-12 
descriptive, 4, 97-99, 154 
distribution-free, 247-254 
need of students for, 1-4 ° 
nonparametric, 247—254 
versus parameters, 155-156 
in research, 2-4 
sampling, 4, 154 
small-sample, 164, 217-226 
Stead, W. H., 411 
Wherry-Doolittle method, 411 
Strong, E. K., Vocational Interest Blank, 
483 
Student's , 217-221 
Success ratio, 381-388 
Sum of squares, between, definition of, 259 
definition of, 85 
of discrepancies, 361 
formulas for, 88, 264-267, 271-273, 277- 
279 
within, 259-260 


t ratio, for correlation coefficients, 219 
definition of, 218 
degrees of freedom and, 218 
distribution of, 218-219 
formulas for, 219-221, 238 
for point-biserial 7, 303 
relation of, to chi, 238 
to chi square, 234 
to F ratio, 264 
Т scale, 494-501 
evaluation of, 498-500 
by graphic method, 497-498 
norms, 495-498, 503-505 
test, following an F test, 263-264 
Tables, preparation of, 18-19 
Tau coefficient, 288 
Taylor, H. C., 380 
selection by tests, 380-385 
Test, battery, heterogeneous, 472 
item statistics, 449-450 
scales, 27-28, 487—503 
work-limit, 74 
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Tests, homogeneous versus heterogeneous, 
445-447, 472 
speed versus power, 447 
and statistics, 7 
"Tetrachoric r, 305-311 
assumptions for, 305 
equation for, 306 
estimates of, 307-308 
graphic, 308 
limitations to use of, 309-310 
"Thorndike, R. L., 321, 374, 404, 412, 427, 
432 
Correction for restriction of range, 321 
cutoff method, 427 
multiple correlation, 404 
personnel classification, 432 
regression phenomena, 374 
regression weights, 412, 415 
"Thurstone, L. L., 308, 388, 462, 464, 481 
factor theory and methods, 464 
factors, 462 
scoring weights, 481 
tetrachoric 7, 308 
Tippett, L. Н. C., 167 
random numbers, 157 
Toops, H. A., 429 
cutoff methods, 429 
Transformation, Fisher’s z, 182-183 
table for, 545 
linear, 493 
proof for, 517 
of measurements, 247-248, 493 
Transition zone defined, 474 
"Trend chart, 22-23 
Trigonometric functions, table of, 546 
"True score defined, 436 
Tucker, L. R., 471 
validity, 471 
Tukey, J. W., 264 
t test following an F test, 264 
Two-tail test, 207-208, 210-211, 214 


Unit in measurement, 27 
Universe defined, 155 


Validity, of biographical data, 484 

coefficient of, 145-146 

use of, in personnel selection, 385 
of composites, 483 j 
criteria for, 463-464 
determiners, of 470-484 
and errors of measurement, 475-476 
and factor theory, 467—468 
factorial, 462 

of wrongs scores, 482-483 
of interest and temperament tests, 388 


Validity, and item difficulty, 471 
of items, 483-484 
meanings of, 461-464 
practical, 462-464 
related to reliability, 470-472 
of right and wrong responses, 479-483 
and test length, 475 
types of, 461-464 

Van Voorhis, W. R., 9, 297 
correction for correlation coefficients, 330 
correlation of correlation coefficients, 193 
correlation ratio, 297 
standard errors, special, 197 

Variability, absolute, 101 
and correlation, 318-322 
relative, 101 
and reliability, 157-158, 457-458 

Variable, definition of, 12 
dependent and independent, 367, 390 
suppression, 403 

Variables, contributions of, in prediction, 

397 
Variance, analysis of (see Analysis of vari- 
ance) 
between, definition of, 259 
components of, 379 
in test, 437, 465-466 
of a composite, 419-421 
definition of, 85 
error, contributions to, 443-445 
definition of, 270, 436 
geometry of, 86-87, 419 
interaction, definition of, 269 
interpretation of, 88 
population, estimates of, 258-259 
predicted and nonpredicted, 379 
residual, definitions of, 269 
of a sum, 419-421 
of a test item, 449 
true and error, 436, 443-446 
between two means, 264 F 
within, 259-260, 361 
Variances, homogeneity, test for, 242-244 
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Variation, coefficient of, 101—102 
use of, 102 
sources of, removed, 275-277 


Wald, A., 226 
sequential analysis, 225-226 
Walker, H. M., 10, 217, 235, 308, 538 
exact probabilities, 235 
nonparametric tests, 254 
point-biserial r, 303 
statistical inference, 217 
Student's / distribution, 246 
Wallace, H. A., 538 
statistical table, 538-539. 
Weber's law and variability, 101 
Weights, optimal, 394 
substitutes for, 422-423 
principles for, 424 
Wherry, R. J., 433 
multiple point-biserial correlation, 433 
Wherry-Doolittle method, 411-412 
Whitney, D. R., 264 
Mann-Whitney U test, 254 
Wickert, F., 21 
Wilcoxon, F., 251, 553-554 
nonparametric tests, 251-253 
table, of R, 554 
of T, 553 
Wilkinson, B., 246 
combined statistical tests, 245 
Woodworth, 474 


Yates, F., correction, 234 
Yule, G. U., 311 
phi coefficient, 311 


z, Fisher's, 182-183 
table of, 545 
use of, in averaging r’s, 3255326 # 
Zero in measurement, 27 
Zimmerman, W. S., Aptitude Survey, 152, 
503-504 


